The same exact parameters and hash size are used on each engine, and the games alternate between white and black for each engine.
So far on 5 games, I have seen 4 draws and 1 WIN by the 2.3 engine....
I thought 2.3.2a was decisively better, am I missing something? Is there some better "benchmark" I should be trying?
Rybka v. X vs Rybka v. Y games are not so good indicators. You should run multiple tournaments of Rybkas vs many other strong engines.
I do not see the point of comparing Rybka 2.3 to others because it is claimed to be the strongest...but I am sure that depends on all the settings involved on any engine.
I will just try 2.3.2a out there in "the field" and see what happens. I guess I was thinking that 2.3.2a would beat anything at all times, aside from any operator errors.....
just what does "the strongest" mean?..if you don't win in the game you are playing, all the statistics for 10000 other games won't matter
When you ask, "which is the strongest engine", the only objective statement is which engine, if you were to play a huge amount of games, would have the highest rating. Rybka 2.3.2a will occasionally lose to Fritz 5.32 because there are some small things in each engine that cause Fritz 5.32 to be stronger after the opening--but this might happen only once out of 50 or once out of 100 games--but it will always happen. When I was at my peak as a human player some years ago (probably somewhat over 2200 FIDE, though now I'd be very lucky to be over 2000), you would expect that, out of 1000 games, I would get possibly a few wins (and more draws) against Kasparov--that's just the nature of things--but he would still beat me in the nine hundred and something other games. I certainly hope you would conclude that Kasparov is the stronger player, even though he would not beat me in every game.
It quite often happens that I emerge from the opening period of moves with a better point score NOT using the book vs using the book. However, for the short games that costs a big price in opening time and in depth on subsequent moves. Using the book gets me out of the opening phase much quicker but with a somewhat neutral score, which hopefully will improve later on. However, nothing more boring than the next 10 moves at +/- .02 or so.
Further complicating things is that many of those opening book moves, after being studied for many years by hand, in my opinion would have a longer range perspective over many moves (like 20 or more), than a chess engine in a shorter game just calculating the best possible score over 6 or 7 moves. So in a game using an opening book, one might hope that the benefits of the book may pay off later in the game. But that is just speculation on my part. I think I do better overall using the opening book, as I have been nailed enough times in the middle game with some crummy "search depth 8 or 9" moves. That happens when the computer seems pressed for time.
One technical question, if you don't mind, is how do engines calculate the point score that one is ahead or behind? That seems to be such a dramatic difference from engine to engine. When 2 engines are calculating a point score for a given move, why is there such a big difference? Sometimes the white engine thinks that it is ahead, but the black engine thinks that it is ahead (both using absolute score)! But usually both engines agree on which side is ahead, but not always.
I was too curious to know about this. I've based my test on 2.2 mp, 2.3.2 mp and 2.3.2a mp versions. So I saw that the "mp" version seems better than others and the 2.2mp was my best before !
This was the result : (Based on a lot of match and tournament - 350 games in all - 30'+0 and 5'+2)
PIV 3.2GHZ /2Go RAM/ 256Mo Hash - Fritz 10 opening and GUI - No Tablebases
v2.3.2mp is always the strongest ! (on match and tournament) - Takes always the 1st place.
v2.3.2a had some difficulty to win against the two other engines.
My conclusion :
If you don't have core2Duo or Core2quad PC, v2.3.2mp seems better for you.
v2.3.2a is the strongest compared to all but you must use it on core2Duo or Core2quad (also AMD X2) PC
if you wish to have the great performance of it.
But I realy confirm that 2.3.2a mp is my best now!!
I think that will help you.
PS. When you do an engine match, use the (good and) same opening for having a reversed color! With this, I think 100 different games will be sufficient.
In my opinion, when you take part in a tournament, you won't have 1000 games there to judge that are the best. So if you are really the best, you will be able to solve a lot of positions but not only one or two.
I accept that rybka even the last version, as other engines, is not able to solve some positions but if you choose a good opening book like Fritz10 or RybkaII for your test, hundred games will show to you 90% of success !
This leaves 2 possibilities, when disregarding learn-features existing in some engines, assuming identical conditions:
1) Each engine play each other twice, one game with each color, and even with a huge number of engines, the result will still be a small sample of games
2) Lets each engine play with an own book containing all possible moves, for say move 1 and 2 (black and white), of course no human preferences is made, and no statistics is present, let the engines mark the moves as they play.
Approach 2) is not "no-book", but the closest you can get when you want a larger sample of games while trying to avoid too many doubles.
Then take it a step further, make new books, put in all possible moves for black and white till move 20... let a bunch of engies play a few hundred thousands of games at tournament TC, and vupti, each engine has now formed it's own repertoire, ready to challenge the known GM-theory... ;) *cough*
>It would be interesting to test engine strength with no opening book.
That is possible with testset books. E.g. Nunn Testsets 2 is a collection of 25 different openings, which are considered to be almost equal for black and white (although I do not agree with that 100%). To eliminate white/black disadvantage you just let the engines play each position two times with alternating colors. Then you get 50 games without book influence.
Testsets is good when you want a quick peek into an engines way of handling some preselected positions, and you may get a good idea about the way they handle certain positions. But - these positions is still chosen by somebody.
>When you ask, "which is the strongest engine", the only objective statement is which engine, if you were to play a huge amount of games, would >have the highest rating.
Why this is the only objective?
This is subjective too.
Perhaps he defines the strongest engine between A and B as the one that wins ALL games they play each other....
That is a perfectly normal and correct definition.
Your definition is much more useful, but we can't say it is more correct or objective....
has the serious flaw that no one can be stronger than a randomly playing monkey. You claim that the definition is "pretty normal and correct" is incorrect since it follows from the definition that Kasparov is not stronger than a randomly playing novice. There is a possibility that a monkey typing randomly on a typewriter reproduce the complete work by Shakespeare, and there is a probability that a novice playing random moves beat Kasparov in a game. Theoretically your definition is absurd.
On the other hand to claim that the only objective definition of who is strongest of two chess playing units is who has the highest rating (if they played a huge number of games) is of course
an over claim. As a start there are slightly different notions of rating and it has been argued that Professor ELO's (that is based on the normal distribution) not is the most accurate. It might also be argued that a program that never make any mistakes and always find a forced win (play perfect) is stronger than a program that makes mistakes and do not always find the correct moves.
There is no doubt that chess playing units that plays "perfect" can vary hugely in how well that score against humans. If the "perfect" playing unit plays ridiculous moves (except the moves are not mistakes) I suppose even a beginner might be able achieve a draw in most games, and the perfect chess playing unit would have a rating very similar to the opponents it happens to have played.
>that is why I keep including the phrase "after a huge number of games"--that point is essential,
That point is a no-point. "huge number of games" is indefinite.....
You should give a value to "huge"....
The point is that if a perfect program wants to "help" the opponent as much as possible, I think any moderately decent player will only loose with very small probability (when he for example happens to make a clear blunder). My point was - and I agree the thought experiment is rather extreme - that perfect play (in the game theoretic sense) does NOT in general imply a high rating. In the extreme case the rating of a player might (within a certain range) reflect the rating of the opponents rating.
Though it is rare, I have seen examples of human players (drawing specialists) with a similar tendency. I knew a quite weak player who nevertheless would play a very high percentage of draws against players rated 1500, but almost play the same high percentage draws against players rated 1900. So in the range (1500-1900) his rating probably was not really well defined since it depended to much on the opponents rating he happened to be playing.
For strong chess programs i think one should be prepared that one might see a similar tendency (especially when some programs might become drawing specialists). These programs might have a rating similar to the opponent with rating in the range 2900-3300 since the programs are a very hard to beat, but on the other hand are not that good in drumming up complications they can exploit.
Have a look a positions in tabel base situations that are drawn. For example a drawn position in Rook+bishop against Rook or even more striking Rook,Knight against two knights.
If you have Rook and Bishop (or Rook and Knight) against Rook (two Knights) in a drawn position you have a very easy game with zero danger of loosing. What it seem to me you are saying is that maybe there will be positions where only one move draw, but I hope you can see that this concern at least is absurd to have for the person with the advantage in the two endgames about. Don't tell me you a afraid of loosing Rook+bishop against Rook!
Consider the Rook+Knight endgame against two knights and assume in this example that you have the two knights. Assume that the two armies a far apart and yours 3 pieces are on one side of the board while the computer/tb has the 3 pieces on the other side of the board. Then (playing a bit around with the tabelbase) you will notice that the position (according to tb) is a "clear" drawn. You as a defender can make any reasonable move, it will remain a draw. If the attacker with the Rook+Knight just move around near the boarder, the games stays drawn almost independently of how you play (this represent cooperative behavior from the tb). If on the other hand the tb, tries to make progress by centralizing the pieces etc you will soon arrive in a position where it become completely impossible for any human player to defend. What you will find (try this) is that one reasonable move might loose in 205 moves, another in 186 moves while a third move keeps the game drawn. Even if you manage to play the move thats holds the draw, in your next move you often have a similar hard choice. What such a experiment shows is that tb+good strong moves in drawn positions in some cases produce a VERY, VERY
strong player, while tb+deliberate weak moves in drawn positions leads to a player with a rating more or less identical to the opponent. My original claim (that I think is highly plausible) is that these phenomenon also apply to the 32-piece tb
P.s. Sorry my long response, but since my original highly plausible view was challenged I decided to elaborate on my view...
1. d4, d5 2.c4, e6 3. Nc3, Nf6 4.Bg5, Be7 5.Bc1 (??), Bf8 (!) 6. Bg5, Be7 7. Bc1 (??), Bf8 8.Bg5 draw
White might strictly speaking speaking not have made any mistakes (since the resulting position after 5.Bc1 is drawn), but please do not tell me it requires much rating to play the black pieces.
The example is of course absurdly irrelevant from practical chess and chess programming, however my point is that I think that "psychological" elements plays an important role in high level chess programming and part of the art is to drum up complications (e.g. unbalanced positions) that increase the chances of success. My examples are extreme cases, but I think less extreme examples are relevant.
For more relevant examples see my review in Chess Base. I was told that the article I wrote at http://www.chessbase.com/newsdetail.asp?newsid=3465 was highly praised by Kasparov and that Kasparov had very similar views. In fact I was told that he though that Chess Base should have given my review an even higher profile.
Unless we allow for the two players to work together to produce a draw by third repetition. I knew that the discussion would eventually come to this, and I didn't address it earlier because this is something different altogether--this is a situation in which a technicality in the rules is what fixes the game result, a technicality that in many situations, including the example you gave, really has nothing to do with the actual game of chess (though there are obviously situations that we see all the time in which it is best for BOTH sides to repeat). You might just as well have said, 1.Nf3 Nf6 2.Ng1 Ng8 3.Nf3 Nf6 etc. I think the discussion is really only worthwhile if we make the provision that the position must be "played out" to where technicalities such as intentional third repetition aren't allowed unless it happens to be the one move that holds the draw. You might think this sounds silly, but we are talking about something very, very different if we allow for intentional third-repetition.
I think that referring to your article is kind of changing the subject somewhat, but I will comment that as soon as I hit the link, I remembered the article, remembering that I was glad even at the time you wrote it and that it came as a necessary rebuttal to the ridiculous claims made by the Crafty analysis group--my first comment after reading that original article was that all they'd done is found which grandmasters play most similarly to a non-tactical third-tier chess engine. In reality, I think that such a method might be possible with a much stronger engine (such as Rybka or even Shredder or Fritz or, better yet, all three combined) if retrograde analysis of the game is first performed and stored before going back through the game. However, this by itself doesn't overcome the problem that you note having to do with psychological elements, such as playing moves that you happen to know will cause difficulty for that particular player, but might not cause difficulty for some other player. Thus, in addition, in the retrograde analysis, one would have to make some sort of lower limit for recording the evaluation difference between "best" move and the text move--perhaps at least 0.25 pawns, which is often used as a criterion for difference between best and second-best moves to determine the probability that someone cheated by using a particular computer program in a game. This, or perhaps double the amount, would overcome most (but not all) psychological problems. Of course, there still remains the problem that some players are far more tactical players than others, and so Fischer and Kasparov would still probably be graded down just because of their styles.
It is the nature of the discussion that none of us can "prove" we are right, however I still think it is intuitively clear that even a weak player should easily be able to hold a draw against a benevolent 32tb player. And I still think its intuitively clear that the rating of a benevolent 32tb player essentially will be that of the average of his opponents within quite a large range (which essentially was my original claim, that then was challenged).
Maybe you could summarize your view and we can then move on....
However, I see a possible counterargument here, but I think it's unclear that it would work--that being that, in its benevolence, we make it so that the 32TB player not only refuses to move past the fourth rank, but plays moves that, as much as possible discourage any types of breakthroughs by the other player. Even with this, however, it seems unclear that a 1500-rated player who is trying to win wouldn't lose. However, I'm starting to see that the lowest elo rating for which a benevolent 32 tablebase player could have might be lower than I would have thought. There is also possibly an isolated rating range in which the 32 tablebase player would almost certainly have the same rating as his opponents, that being somewhere higher than the randomly playing monkey, but somewhat lower than a 1500-rated player--a player to whom I'm referring knows how to move the pieces and knows how to avoid direct blunders, but doesn't know how to formulate a winning plan--such a player might have an elo in the realm of around 800 or so.
>If you have Rook and Bishop (or Rook and Knight) against Rook (two Knights) in a drawn position you have a very easy game with zero danger of >loosing. What it seem to me you are saying is that maybe there will be positions where only one move draw, but I hope you can see that this >concern at least is absurd to have for the person with the advantage in the two endgames about. Don't tell me you a afraid of loosing >Rook+bishop against Rook!
Yes but in most cases endgames are with Pawns and there a loss of a single tempo is most times critical. So from possible 50 moves one can play the 1500 ELO player would see 5-10 playable and from this only 1-2 will keep the draw. He can find the best one, can find it 2, can even find 3 but i don't think you could expect from him to find it for 20-30 moves.....
Even worse of course is the situation that almost surely the position will become lost for the 1500 player, during the middlegame or even the opening.
>A 32 tablebase program that always find the fastest win if a win exist, will never lose a game.
This is not true. Assuming you mean this tablebase program will play perfect moves in drawn positions too, and not playing bad moves converting drawn positions into losing one's then:
If a win for white exists for example and it is playing on the black side there is always the probability for the white side to find all the best moves and win against this perfect 32-tablebase program.
>if the program deliberately plays ridiculous moves (always keeping a drawn position drawn)
I don't understand this.
If you have a drawn position, then if you play a move that keeps the draw, why do you call it ridiculous move?
I would call ridiculous a move, when you have a drawn position and you play a move that loses.
>However my point is that if the program - as a thought experiment - deliberately plays ridiculous moves (always keeping a drawn position drawn) >a 2000 player will have no problem with virtual certainty to draw the game. If the program plays white, it might open with 1.a4 (assuming this >holds), then a reasonable player might answer 1.- e5 after which the program might play 2.Ra2 (that might theoretically still be a draw).
I don't agree with this too, since after 1.a4 (draw we assume) 1...e5, there might be many many drawing moves that would give to the black side much more difficulties than the 2.Ra2, which is not so complicated to play against.
And after 2.e4(hypothetical drawing move too) for example, then black could have only 1-3 choices that draw and much more difficult one's since 2.e4 gives him more complications and troubles than the Ra2. Imagine now that black has to hold for many many such moves and play correctly many such moves to hold the draw. I don't think a 2000 ELO player can do it....
>No one can prevent you in making our own private definitions, but please do not expect us to use them.
I did not say you should use them. I said that there are other definitions for the sentence "stronger Chess engine"......
>Your alternative "definition" that A is stronger than B only if A always wins against B has the serious flaw that no one can be stronger than a >randomly playing monkey.
Even if that was true, it doesn't matter and does not make the definition to have a flaw. A definition of that type, can't have a flaw.
>You claim that the definition is "pretty normal and correct" is incorrect since it follows from the definition that Kasparov is not stronger than a >randomly playing novice.
This is not correct. If Kasparov plays 10 games against a randomly playing novice he would definitely win all 10 games. So according to the aforementioned definition Kasparov is stronger than the randomly playing novice......
>Theoretically your definition is absurd.
Silly yes, with no real meaning yes, without any usefulness yes, but not wrong. A definition of that type can't be wrong or correct. It's just a definition.....
The solution is simple: don't rate draws!
> just what does "the strongest" mean?..if you don't win in the game you are playing, all the statistics for 10000 other games won't matter
One Engine A ist stronger than another one B can mean:
A will win with an avarage og 60% of all Points.
You can only detect this with a satisfactioning probability if you do many Games.
Note that by that metric, unless you're playing exactly the same engine against itself, one side will almost guaranteed be stronger than the other. However, at that point, the next question becomes "by how much"... which leads to the Elo rating system. :-)
/* Steinar */
Even if an engine A is stronger than another one B,
and even if this strength-difference leads to an expectated value of 60% for A,
you will have a rather large chance to get an contrarious result if you have made only a view games.
In other words:
If you have done only a view games with a result "B ist stronger than A, it won 60%" you must be aware, that this result could occur with a rather large probability, although if in fact "A is stronger than B, it will win 60% at long site"
>just what does "the strongest" mean?..if you don't win in the game you are playing, all the statistics for 10000 other games won't matter
It would matter if you use the current ELO method for measuring performance that gives the indication of the strength of a program.....
If you have a box with 1000 balls, 999 white and 1 black then you can't expect that each time you draw a ball, you will pick a white one. There will be times a black one will be chosen.....
The same occurs with Rybka 2.3.2a and Fritz 7.
Just try 5+0 games (5 minutes without increment). After 100 games you normally will get a first tendence but you will of course need infinite games to get an exact result.
That's why there are the testing groups, for example this rating list has a good number of games (~1300 for the latest version, but only against other engines):
you can find more links to rating lists on the front page of www.rybkachess.com (under independent testing)
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill