a good match for Naum 3 here:
[code]CEGT Quad tournament time control 2008
1 Rybka 2.3.2a 64 4CPU 1½½½½1½½½½11½½½½½½01½101½½½½1½10½1½½1½½½½0½½½½½½½½ 28.5/50
2 Naum 3 x64 4CPU 0½½½½0½½½½00½½½½½½10½010½½½½0½01½0½½0½½½½1½½½½½½½½ 21.5/50[/code]
43% and best result from all engines so far against Rybka, a tiny bit better than Zappa Mexico II in the match by Clemens. Naum 2.2 got 30% against Rybka.
For comparison more results from Naum 3 up to now.
[code]CEGT Quad tournament time control 2008
1 Naum 3 x64 4CPU ½½1½½½½½1½½½½½1½1½½10½11½010½½0½½½0½½11½0½½1½½½1½½ 28.0/50
2 Zappa Mexico X64 4CPU ½½0½½½½½0½½½½½0½0½½01½00½101½½1½½½1½½00½1½½0½½½0½½ 22.0/50[/code]
Another fine result against the first Zappa Mexico version, as Naum 2.2 got 48% against it. For Naum 2.2 against Zappa and others compare here:
The finished matches can be replayed and downloaded here:
Overall download update on Sunday.
Naum 3 fought back against Deep Shredder 11:
[code]CEGT Quad tournament time control 2008
1 Naum 3 x64 4CPU 0½½1½10½½½½00½½010101½11½½1111½0 17.5/32
2 Deep Shredder 11 x64 4CPU 1½½0½01½½½½11½½101010½00½½0000½1 14.5/32[/code]
Naum 2.2 got 50% against DS 11.
This match will be finished on Sunday morning with rating list update. After 132 games we have the ELO improvement Alex initially estimated....and of course huge error bars.
Anyway a fine start in all matches. Next oppoents will be Hiarcs 11.2, Deep Fritz 10.1 and Loop M1-P.
I recently started testing on 1 CPU and have noticed something similar--about 60-70% draws. This is astoundingly high. It's interesting, though in retrospect, I'm not so happy that I bought the engine...
you will soon see that Naum is not closer to Rybka regarding choice of moves than other top engines including Hiarcs.
I give a copy from another post where I already announced that I let some Linares games analyze by five top engines, this means every single move from GM games by all including full analysis line with 3 minutes for each move.
Hi Per ,
thanks, looks really that we will have a new second best engine and I doubt that Zappa Mexico II will be better than Naum 3.
Toga is on top of the waitung list and we also wait for Fruit mp, not announced so far as far as I know.
Naum 3 will be finished in two weeks and Toga could start then, if no new Rybka or Deep Fritz will be released before, what I do not expect. If it happens two engines can be also tested at the same time, what would last six weeks instead of three.
It could be also Toga´s turn because Zappa Mexico II only shows the minor improvements expected by Anthony and the new Zappa finished the 2nd CEGT Quad Marathon Championship 40/400 after 42 games with a disappointing result. Werner will give links to this one with rating update on Sunday and when I have posted in CEGT and Rybka forum what will happen probably tomorrow. I will also give remarks about openings and novelties in certain lines based on "The Week in Chess" and others where I collected a top database. Started is the 3rd Quad Marathon match Rybka 2.3.2a against Naum 3. First games will be also available on Sunday. Change here will be that I will play with elected sets from Harry´s new book, giving this way priority to lines en vogue and a frequency of variations chosen in current GM practice. I am doing this in order to see more novelties. But classic variations from the past century when top players where still not influenced by the strong engines will also be included.
Seeing by your efforts with positional and gambit rating lists that a lot of things can still be done this weekend you will also get games from Linares just finished with +ELO 2700 players. They are fully analyzed by the top engines Rybka 2.3.2a, Naum 3, Zappa Mexico II, Deep Shredder 11 and Hiarcs 11.2 mp versions with 3 minutes per move for each engine consecutively. Will also give statistical data regarding this. Commenting such a game this way also lasts averagely more than 24 hours, but who cares.
Still Rybka is beating Naum 3 +12−1=13
If you choose 4 cpus result is more balanced(+7 -2 =23 for rybka)
Maybe if you choose 4 cpu and 32 bits naum has better chances but testers do not test in this way.
I was not trying to understimate naum, just that though CCRL statistics (Pond hit and Eval diff) show that naum and fritz are getting similar to rybka, still rybka is better.
A. Naumov already announced it months before the release
CCRL plays only ponder off games, so we better think about what this ponder hit stats really are.
I think it is better like I am doing now to let engines analyse complete top GM games with same long time (3 minutes per move) and also see the complete main line given. I will give exact amount of hits for same moves, but can already tell you that main lines are mostly different, unlike what I saw when comparing Strelka and Rybka 1.0 beta some time ago, when almost all main lines were the same.
Hopefully we will not see that Anand, Kramnik and others are already Rybka clones :-).
Un muy cordial saludo
Well, I am a bit lost now, what are pointing out those CCRL statistics then? And the change in the numbers fron Naum 2.2 to Naum 3 and from Fritz 10 to Fritz 11?
Notice that I did not use the word clone ;)
That test about analyzing Morelia-Linares with different engines is very interesting, waiting for the results ;)
Un saludo Heinz
yes, it seems a bit strange to have ponder stats when no ponder on games are played like the Swedes are doing. As far as I know they take the games with comments and check the main lines, what move is expected, that means one move earlier. Real ponder on would mean to see that an engine already is on ply 18 for example when it is it´s turn again in a match when two computers are linked with a cable.
To show what I am doing take a look here for a first completed game where Anand lost with White pieces against Aronian, choosing the Marshall attack just finished in Linares. There are more games in progress and I will give an update on Sunday or Monday. To gain conclusions and interesting insight, we will need many dozens of this high end games and all kinds of games, that means positional, tactical, endgames and so on. If you like to have a special game analyzed, just post it and I will add it. Might also be a classical one with top players. The test is done with the complete analysis feature in Zappa 64-bit GUI, where one engine after the other calculates the same position (not all at once). Used is a dual core machine AMD X64 4200+, 1024 MB hash, 3 minutes per engine for each ply and engines are 64-bit versions. This is because the Quad machines are busy with the CEGT 40/120 quad rating list and the Marathon matches 40/400. But maybe I will use Quad machines later.
We have 68 plies in this game (category tactical) and you have to bear in mind, that many moves in the end were forced, leading to mate.
Statistically still insignificiant, but just for fun here the first stats in category tactical:
Engines choosing the same move than Rybka 2.3.2a mp after three minutes:
Hiarcs 11.2 mp -- 46 identical moves
Zappa Mexico II mp -- 45 identical moves
Naum 3 mp -- 43 identical moves
Deep Shredder 11 -- 40 identical moves
When you check the main lines given I think that so far none of those engines seems to have similar output than Rybka. I will add Deep Fritz 11 to this tests as soon as it is out.
If you have fun to do so, you could also check the GM´s playing most similar to which engines after more games.
I understand what you say about CCRL method not being completely correct, it is not the same a depth d, that d-1 or d-2. But do this mean that they are not right and that cannot be taken seriously?
For example, the highest correlation showed so far there was
Strelka 1.8 vs Rybla 1.0 beta x64 71.4% 0.23
I think they can be taken seriously like CEGT also.
Presentation of stats is good, but when they are based on few games the stats might be misleading. CCRL is aware of this, but many readers not. People will never stop to take premature conclusions, but better wait for many games.
Regarding real ponder on stats, SSDF could give them, but they seem to have a problem with many testers in the past years, so they are not too much up-to-date.
About SSDF statistics, I cannot see them at the site, do you think that they will post them when they update the list with engines on quad results?
yes I understood and agree. People really might think that they run ponder-on games giving stats called ponder stats.
Regarding SSDF. This was my favorite many years ago, but I completely lost interest, because of too less important versions tested there and so I do not even know exactly what is on their website or where we could find more details. When they announced to upgrade to Quad I thought it would become more interesting again. But now I see that they test Quad against very old hardware what on the other hand is understandable in order to have consistency with old results. It is really a pity that they do not have more ressources and testers, as many people still like those ponder on games.
in an attempt to clarify ;)
Currently all CCRL games are ponder off, however we take note of the expected move (the move that would have been pondered) when provided by the GUI.
This expected move is used to calculate the ponder hit stats we provide...
All the best
It's quite simple. Perhaps "ponder hit" is badly worded. What it means is "expected move hit". Maybe we should re-name it.
In the chessbase GUI for example, when Engine A makes a move, the GUI records in the pgn the move Engine A expects Engine B to respond with. Likewise, when Engine B makes a move, the GUI records in the pgn the move Engine B expects Engine A to respond with. The CCRL programming pulls this information out and presents it. The stats are thus completely valid and very meaningful. And the stats are gathered over potentially hundreds of games and thousands of moves (although games per engine pair are < 100). So they are good stats, very good, and careful thought has gone into them. Of course, there is always room for improvement
The website has this to say:
Here you can see statistics of expected moves, also called "ponder hit", in CCRL games. When two engines can predict most of the moves in their match, it means that they share similar understanding of chess, similar thinking. Ponder hit statistics shows how exactly similar they are. This data can be collected from simply a database of played games, so it is convenient way to find what engines are similar or different from each other.
It looks simple — just count the predicted moves and divide by number of all moves. It is simple, just there are a few things to consider. First, there are opening moves, where engines don't think. We don't count such moves in this experiment. Second, there are forced moves, where there is no other choice. Such moves should not be counted too. We detect such moves by the time spend on them, so all moves made in 0:00 seconds are not used for this analysis.
Then, there are tablebase moves and mating lines. Such lines are characterized by many forced moves, but they also have many situations where it does not matter what to play. The result is that ponder hit statistics is not so meaningful in such lines. Ponder hit statistics is much more interesting in middlegame positions, where the move choice actually shows engine playing style and understanding. To limit this experiment to middlegame only we exclude all moves made with evaluation of +−9 pawns or more.
Finally, there are boring 50-move lines where engines don't know what to do, but still trying to avoid draw. In those lines engines play shuffle chess and any ponder hit analysis is meaningless. What's worse, just on the 50-th move they will move a pawn to avoid draw, and the shuffle chess continues for another 50 moves. Such cases are difficult to detect automatically, so after few experiments we decided to just ignore the drawn games completely. So, only decided games are used for correlation analysis in our study.
May you be more explicit, please? Sort of safe playing and drawing attitude?
I am actually cutting off the tests early at 54 games so that I can free up my CPU for other stuff. The results are +2-4=12 against Rybka 2.3.2a, +2-2=14 against Fritz 11, and +5-4=9 against Shredder 11. This is all on 1 CPU with a 32-bit operating system at CEGT 40/20. If you have Rybka 2.3.2a and Zappa Mexico, there is absolutely no reason to buy Naum 3. While Naum is better at finding draws than Rybka, I think that Zappa Mexico will also find those draws, and Zappa Mexico is also a bit more tactical (if only slightly weaker) than Naum 3.
thanks for the results. When you compare with those CEGT 40/20 testers got with Naum 3 X64 4 CPU so far, you will probably confirm that for "power users" at least Naum 3 is a good option.
Regarding Kramnik you may be correct regarding playing style, but I doubt that Vladimir could get 8 draws against Naum under equal conditions and with no handicap for Naum, not even against the single 32-bit version. Maybe I underestimate human top players, but they are prone much more to tactical mistakes than top engines.
> Maybe I underestimate human top players, but they are prone much more to tactical mistakes than top engines.
But if neither Vladimir nor Naum do anything to try to win the game and they stay happy with a draw, game after game, then being able to draw these 8 games sounds likely.
but beating Fritz 11 4½-2½ and Toga II 1.4 beta 5c 4½-1½ is not a bad start, with zero losses! It must be doing something right... On CCC somebody who had said before he would not buy the engine, came back from this decision watching some of Naum's games and he compared the playing style to Rebel. An improved Rebel is not a bad program!
> On CCC somebody who had said before he would not buy the engine, came back from this decision watching some of Naum's games and he compared the playing style to Rebel. An improved Rebel is not a bad program!
It is only his opinion. As a Pro Deo fan by heart, I can say that Naum doesn't play like Pro Deo.
> But my point is that I think it is very hard to achieve these kind of results, especially good results against Rybka.
I don't care about their scores against Rybka, but about their play style. If Naum has a very drawish playing style, I'm not interested (Doesn't it have some kind of configurable contempt so by changing it Naum tries to avoid draws? It could be interesting.)
percentage of draws seems to depend on opponents. Many against Rybka and Zappa, so you can also blame those two. Like Uri explained I guess that this is because of the highest level in this matches between top three engines. There are also many draws between Rybka and Zappa by the way.
There are fewer draws against Deep Shredder 11 and so far none against Deep Fritz 10.1 after the first games and when you check all games you will find a lot of highly intersting tactical battles, although when there might be a draw in the end. But you will see more tactics and devastating attacks between 2500 ELO engines or when you have matches between engines of very different strength.
Here is the text of Naum's configuration file if you use Naum 3 as a Winboard engine:
# Winboard configuration file for Naum chess engine
# (needs to be in the same directory as naum.exe)
# This file is not used when the engine is in the UCI mode
# Set to 1 to enable pondering (thinking on opponents move)
PONDER = 0
# Set to 1 to enable book learning.
# Note that Naum saves learned info in the book file, so if
# you replace the book file, all the learned info will be lost.
# You should use a separate book file for blitz test tournaments,
# if you don't want less accurate blitz learned info to influence
# openings Naum plays under regular time controls.
LEARN = 1
# Set to 1 to allow engine to resign a game
RESIGN = 1
# If set to 1, engine will not clear the hash tables when position
# on board is changed. This is usefull for analysis, because
# the engine will keep hash entries when user goes backwards
# or forwards on the move list while the engine is in the analysis mode.
# When playing a game this option is ignored, because the engine
# will always keep the hash values.
# Also when the 'new game' command is issued, hash is always cleared.
PRESERVE_HASH = 0
# Transposition table size in megabytes (min 8MB, max 1GB, default 64MB)
TT_SIZE = 64
# Path to endgame tablebase files
# EGTB cache size in MB (min 1, max 128, default 32)
EGTB_CACHE_SIZE = 32
# Maximum number of threads (CPUs) to use (min 1, max 8, default 1)
MAX_THREADS = 1
# Use positive value to tell Naum the draw is bad, or negative value to indicate the draw is good
# Warning! Using contempt may reduce the playing strength, but it might be good against humans
DRAW_CONTEMPT_SCORE = 0
# Smallest depth in the search tree at which to probe EGTBs.
# Increase this value if the EGTB probing is slowing down the engine.
MIN_EGTB_DEPTH = 3
# Use a positive value to increase the importance of material in the position evaluation.
# Use a negative value to decrease the importance of material compared to the other
# positional factors and king safety.
MATERIAL_IMPORTANCE = 0
# Configurable material evaluation parameters.
# Parameter values are in centipawns and will be added to the default value.
MINOR_VS_PAWNS_SCORE = 0
ROOK_VS_PAWNS_SCORE = 0
ROOK_VS_MINOR_SCORE = 0
TWO_MINORS_VS_ROOK_SCORE = 0
THREE_MINORS_VS_QUEEN_SCORE = 0
TWO_ROOKS_VS_QUEEN_SCORE = 0
Naum 3 SP 32-bit has lost just one rating point since yesterday in CCRL 40/40 but with 2981 (+44/-43) after 157 games it is still ahead of Naum 3 SP 64-bit with 2949 (+48/-48) elo. Not without losses anymore against Fritz 11 and Toga II 1.4 beta5c. Thanks to CCRL, Graham I think for the updated list, nice to watch the progress of a program this way if possible!
It is logical to expect more draws when the level is higher and
there are often more draws in match for the world championship between humans
The first match between Kasparov and karpov had 40 draws out of 48 games.
Considering the fact that computers today are stronger than humans I wonder what was the reason for the big number of draws
between humans even in matchs for the world championship when a draw with white is a bad result so there is no reason for short GM draws.
Note that even Emanuel lasker had 8 draws out of 10 games in one of his matches (against Carl Schlechter) and it is clear that lasker had good reasons to try to win because he lost in game 5 of the match but he could get only draws in games 6-9
Zappa 222 draws
Shredder 204 draws
Naum 229 draws
Hiarcs 203 draws
Fritz 201 draws
Loop 199 draws
Glaurung 190 draws
Only 3 programs have less than 40%
Junior 169 draws
Rybka 168 draws
Bright 151 draws
Note that Naum3 does not seem to do more draws than Naum2.2
Notice that 19 out of 32 games of Naum3 against shredder11 were not drawn.
CEGT Quad tournament time control 2008
1 Naum 3 x64 4CPU 0½½1½10½½½½00½½010101½11½½1111½0 17.5/32
2 Deep Shredder 11 x64 4CPU 1½½0½01½½½½11½½101010½00½½0000½1 14.5/32
> It seems to me that the ultra-selective engines, modeled after Rybka, are good at finding good moves and not playing bad moved, but don't do such a good job at finding the best move. If this is true, both avoiding bad moves and not finding the best move will contribute to a higher draw percentage.
I think you're right, and that this approach to the game also produces some boring games; when engines that don't follow Rybka are playing more lively chess. Because best moves are usually aggressive, active or at least they unbalance the position, and other engines play such moves more often, but they perform worse because they also play bad moves more often.
Test suites when the best moves is a sacrifice are misleading and in most cases the best move is not a sacrifice,
In a competitive field where no one can consistently beat Rybka, perhaps the best way to gain Elo points is to consistently draw Rybka while Rybka consistently beats everyone else.
Naum does not consistently draw rybka.
Naum3 32 bit lost 20-10 against rybka 32 bit in one CCRL match when it beated rybka 2.5-0.5 in another tournament by graham bank
so the total result of rybka 32 bits against naum3 32 bits is 22.5-12.5
> The backward analysis reveals a better move a reasonably high percentage of the time.
What I do is match the Rybka analysis against the analysis of a more active engine (Like Toga.) The first time I was expecting that Rybka was going to convince the other engine that Rybka's moves were best, but I was surprised that this engine was convincing Rybka that her moves weren't best, and that there were better alternatives. Of course, going back some moves and asking the engine for another move kept the analysis going, and Rybka and the other engine were finding better alternatives to their previous moves. These alternatives were playable moves that put more pressure on the opponent.
Rybka was thinking that the position was a draw, but the moves from the other engine were raising Rybka's scores and Rybka had to accept that her moves weren't best. I found richness in the position that I couldn't have found if I used Rybka alone.
> Yes, I have found the same thing. Rybka's strength seems to be mainly associated with not making bad moves, as opposed to making best moves.
Playing the best move in a position is really hard. I doubt that Rybka with an hour on a quad plays the best move even let's say half the time. This isn't some sort of design decision - it's just that chess is hard.
I attribute the increased draw percentage in engine games to the general decrease in engine bad moves while most seem to attribute it to engines making best moves.
It's not really a matter of playing best moves, although the changed player will play more of those.
> I attribute the increased draw percentage in engine games to the general decrease in engine bad moves while most seem to attribute it to engines making best moves.
Ok - at the moment I don't think that there is much we can attribute to playing tons of best moves. Maybe in ten years.
1)we do not know if rybka is best at the very slow time control that you use
the slowest time control that rybka is tested is 400/40 and you use even slower time control
2)I do not know if you can trust the pv of rybka to be the moves that rybka is going to play later and rybka may have a bug in finding pv moves when fortunately in games she get bigger depth in the next move so she is not going to play them.
If rybka has pv 1.xx yy 2.zz at depth 28 then it does not mean that rybka is going to play 2.zz in a game at depth x<28 so if 2.zz is not the best then it proves nothing if rybka does not play this move.
3)I do not know if other engines are more often correct in finding the best move.
Note that I also felt that rybka is stupid in some correspondence games that I played(I now do not play coorespondence games) but the fact that other engines can help does not mean that other engines are better in finding the best move and it is possible that they are even worse in the number of times that they can find the best move.
2) If the PV is wrong (which is not unusual far from the root), this should show up in the backward analysis. The new move is then followed to the same depth prior to starting backward analysis. This is not really critical though. The interesting thing is that backward analysis not infrequently finds a better move at the root. Rybka will stick with this "better" move at the same depths as the original move was calculated to, and with a better eval.
3) True. I have no hard evidence that Rybka is more prone to this behavior than other engines. I have only the impression that Rybka is better at finding bad moves and other engines may be better at finding the best move.
I do know that in a significant fraction of the CC games I play, Rybka will play into a lot of pawn up draws if I just follow her recommendations. Once again, I don't have hard evidence that Rybka is worse here than other engines.
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill