Not logged inRybka Chess Community Forum
Up Topic Rybka Support & Discussion / Rybka Discussion / 10-game match RR update
- - By Dragon Mist (****) Date 2007-06-24 23:26
Here's update table of the round robin tournament where I put each engine play each other a 10 game match. PIII 1.7GHz, Win XP Pro, single processor, 40'/40+15'/20+5'/rest, 64 MB for hash, ponder off, balanced classical games book up to 8th move, Nalimov 3-4-5, all Rybka's Nalimov usage Normal.

Rybka 2.3.2a got off to a terrible start, loosing all 4 matches (3-7 to 2.2, 4-6 to 2.3 LK, 4.5-5.5 both to 2.1o and 2.3). "Small sample people" will probably rush in immediately, but one question boggles my mind: if we presume 2.3.2a is the best so far, what exactly is the probability for it loosing 4 10-game matches in a row to previous versions? I believe it is well below 10%. Maybe 2.3.2a will go up after having played weaker opponent, but somehow I doubt it.

Rybka 1.1 and Ktulu 8.0 have a couple of matches left to complete the tournament.

Anyone interested in the games, please drop me a note.

Dragon Mist

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.2 32-bit               : 2996   43  42   160    64.7 %   2891   39.4 %
  2 Rybka 2.3 LK 32-bit            : 2985   42  42   160    63.1 %   2892   41.2 %
  3 Rybka 2.1o 32-bit              : 2968   42  42   160    60.6 %   2893   41.2 %
  4 Rybka 2.3 32-bit               : 2965   40  40   160    60.3 %   2893   45.6 %
  5 Rybka 1.2f 32-bit              : 2962   41  41   150    60.0 %   2892   46.7 %
  6 Rybka 2.2n2 32-bit             : 2962   43  43   150    60.0 %   2892   41.3 %
  7 Rybka 1.1 32-bit               : 2948   46  46   130    55.0 %   2913   42.3 %
  8 Rybka 2.3.1 32-bit             : 2931   43  43   150    55.3 %   2894   41.3 %
  9 HIARCS 11.1 UCI                : 2918   47  47   150    53.3 %   2895   30.7 %
10 Rybka 2.3.2a 32-bit            : 2911   76  78    40    40.0 %   2981   50.0 %
11 LoopMP 12.32                   : 2879   42  42   150    47.3 %   2897   44.0 %
12 Shredder 10 UCI                : 2879   47  48   150    47.3 %   2897   28.0 %
13 Toga II 1.3x4                  : 2859   46  46   150    44.3 %   2899   32.7 %
14 Junior 10.1                    : 2834   47  48   140    38.6 %   2915   34.3 %
15 Chess Tiger 2007               : 2801   51  51   130    34.2 %   2915   31.5 %
16 Deep Sjeng 2.5                 : 2719   59  61   130    23.8 %   2921   20.0 %
17 Ktulu 8.0                      : 2695   64  66   120    19.6 %   2940   20.8 %


Games        :   1190 (finished)

White Wins   :    437 (36.7 %)
Black Wins   :    313 (26.3 %)
Draws        :    440 (37.0 %)
Unfinished   :      0

White Perf.  : 55.2 %
Black Perf.  : 44.8 %

ECO A =    138 Games (11.6 %)
ECO B =    349 Games (29.3 %)
ECO C =    216 Games (18.2 %)
ECO D =    269 Games (22.6 %)
ECO E =    218 Games (18.3 %)
Parent - - By Roland Rösler (****) Date 2007-06-25 00:33
It´s dragon "mist"; in German bullshit!!
Parent - By RFK (Gold) Date 2007-06-25 00:48 Edited 2007-06-25 00:52
Roland,

You have such an eloquent way of putting things! You are direct and leave no misunderstandings!

Robert
Parent - - By turbojuice1122 (Gold) Date 2007-06-25 00:50
Well, your stats give you the probability immediately: there is roughly a 16% chance that this is "completely" statistics and nothing else going on.  However, I also question the idea of that particular time control you're using--it's a very strange time control, and some engines might react to it much better than others--you're playing almost "classical" chess and then all of a sudden fast blitz at a point where the game is on the threshold of being decided.  I think you'll have quite a lot of random stuff kicking in there.  Of course, you can also compare your results against "known theory"--thousands of games have been played, and 2.3.2a is definitely not worse than 2.3.1 or 2.2, though some lists might suggest that it's nearly equal.  You also have a very unusual situation in that Rybka 1.1 is above Rybka 2.3.1, while in all rating lists, version 2.3.1 is VERY much ahead of version 1.1.  I would say that there is something wrong with your choice of time control--for some reason, the 2.3 series doesn't like it very much.
Parent - - By Dragon Mist (****) Date 2007-06-25 08:39
Thanks turbojuice for kind answer and your thoughts, something that some other posters are obviously unable to accomplish.
This situation is very much unusal, and the reason I post it is I try hard to find the logical explanation, and was hopeing I missed something.
- time control: I believe it is a decent time control, giving an average of 60 secs for the first 40 moves (it is actually 75 secs, as 99% of the games start from move 9 due to the book I'm using). This should give engines enough time to decently play the key part of the game. Most human games would have been decided by this time. Engines however tend to decide the outcome later, so I gave for the next 20 moves an average of 45 secs, which is still very good. Blitz finish is needed to keep the games within reasonable finish time, although I might have added increment of 1 secs for every move.
- book: it is derived from CB standard GM database and limited to 16 ply. Furthermore, engines are not allowed to play rare lines, so it all results in very classical and still wide range of opening, which shouldn't favour anyone, and should limit opening wins to the minimum; this I can confirm, having watched hundreds of the games.
- single processor: it might be that newer Rybka versions are better at using more than one processor, although Vas denies that;
- instant reply: more recent Rybka versions all use this "feature", and from what I've seen, it results quite often in non-optimal play - I believe this is one of the main reasons for relative inferiority of the 2.3 versions; it does not however explain why 2.3.1 and 2.3.2a are so badly performing,
- system: maybe on such slow systems as I am using there is just not enough depth for newer Rybka's to display their superiority; it is a long shot but for the moment I heavily rely on this theory, as nothing else seems to make sense; do note, however, that other ratings (non-Rybka, and most Rybka) are quite consistent with many rating lists around;

Dragon Mist

P.S: I might try to run the entirely blitz tournament, 5'+1"?
Parent - - By turbojuice1122 (Gold) Date 2007-06-25 12:07
Starting "at the end" first, yes, you would probably achieve more representative results if your tournament was "all blitz", such as the 5'+1" time control that you are suggesting.  I don't really think it's a problem with your system, as one could equate it with newer systems simply by dividing your time control by 3 or 4 or whatever.  I don't think it's so much the amount of time that is spent thinking--after all, Rybka is by far the best engine at blitz.  I think your earlier time control somehow "fools" Rybka for one reason or another--at least, that's the theory I currently have.  This leads immediately to your previous idea about the "instant reply".  Normally this works well, as it only happens when the reply is an "only move".  Perhaps there is something going on with either the change in time control or your particular system that is "fooling" Rybka into making "only moves" in situations where it shouldn't--you could probably check that by looking at the games and letting it evaluate for longer in those situations where it made "only moves".

I don't think it's an issue with single processor--I have a single processor, and version 2.3.2a is, under the same conditions, far better against opposition than version 2.3.1 or any of the previous versions (version 2.3.1 was slightly better than versions 1.2f and 1.2n, which were in turn slightly better than version 2.2 and far better than version 1.1; this agrees pretty much with the various rating lists).

As for the book, it's possible, but not likely, that it someone ends with positions that work better for one Rybka than another.  You could always test this by doing the same tourney but with a different book, such as the RybkaII.ctg book or the free sheebar.ctg book or something.  You mention that engines are not allowed to play rare lines--I wonder if this leads to some repetition in the games?  You can always check the ECO codes on the game list afterward.
Parent - - By Dragon Mist (****) Date 2007-06-26 08:48
Hm, hm, hm.

My reasoning for this time control is that, having such an outdated slow system, it equals roughly the blitz on some newer systems (fast PIV say).
However, the pure "logic" that blitz yields more realistic ratings than slow time control on the same hardware doesn't get to me.
As for the combination of time control and instant reply, this would suggest that 2.3 series is good only if there are no changes in average time per move setting throughout the game?! (40'/40+40'/40+20'/20 etc?)
I didn't really have time to check all the instant reply move "blunders", but saw many examples like this: 2.3XY (whatever) plays 2.2. Evaluation rises steadily for 30 moves or so in 2.3 favour; from +0.30 to +1.70. Now we're in relatively simple endgame, with decent time on the clock for both. Suddenly 2.3 plays instant move. Next move, evaluation (which was more or less simmilar for both sides) goes back to say +.090. Ok, this is still winnable, but ...
If I use RybkaII.ctg, this will clearly favour Rybka's as opposed to non-Rybka engines, as I believe this book is well-suited for Rybka style of play.

Dragon Mist
Parent - - By turbojuice1122 (Gold) Date 2007-06-26 11:44
I also believe that a longer time control yields more realistic ratings...for longer time control games.  However, blitz time control yields more realistic ratings for blitz time control games.  I would think it best that no matter what you do, be consistent: if you want to test longer time control, then keep it longer time control throughout the game, such as 40'/40 repeated, or 40'/40+20'/20+20/rest.  The instant move bug in 2.3 was cured at least as of 2.3.2, and it only moves instantly when it's an "only move".

As for the book, I just gave an example.  There is actually quite a bit of debate about whether the "own books" are really best suited for that particular engine's style of play.  You could use more of a generic book (but still one that is good and quite representative of the games you'll encounter in engine play) such as sheebar.ctg or tourbookII.ctg, both of which are freely available and are probably nearly as good as the RybkaII.ctg book.
Parent - By Dragon Mist (****) Date 2007-06-27 07:19
You've convinced me. As soon as I complete 2.3.2a and 1.1 and Ktulu 8.0 (in a matter of days), I'll start blitz tourney, with time increment, and using sheebar, but will limit book usage to maximum of 8 (or 10?) moves.

Dragon Mist
Parent - - By Lukas Cimiotti (Bronze) Date 2007-06-26 19:22
I am just running an engines match Rybka 2.2 - Rybka 2.3.2a on my oct. According to this website http://www.jens.tauchclub-krems.at/diverses/Schach/fritz9_benchmarks.html my computer should be roughly 15 times faster than yours + additional 60% for using 64 bit - that makes it around 24 times faster for Rybka. My match is running 5+1, so depth of search should be deeper in my games. Until now, 243 games give a result of +27 - 54 =162   44.4% for Rybka 2.2, so i think 2.3.2a is significantly better.

Btw. is that really a PIII overclocked to 1.7 GHz?
Parent - - By turbojuice1122 (Gold) Date 2007-06-26 21:19
Yes, using the "random walk" analysis, since you have 81 decisions, you have a standard error of Sqrt(81) = 9 (how convenient--thanks for have a perfect square number of decisions :-) ), and one would need to move 1.5 standard deviations to equality, so there is only a 6.7% chance that your results are due to randomness in statistics alone.

However...many people are getting far different results when they play their respective Rybkas against other engines and then compare the results--much closer to being equal.
Parent - By Lukas Cimiotti (Bronze) Date 2007-06-26 21:41
Yes - you put it to the point :)
The number of games is increasing. Now there are 260 games +29 =172 -59 44.2%
And the chance of beeing wrong goes down.
Parent - By Dragon Mist (****) Date 2007-06-27 07:20
You're right, it is actually Celeron @1.7. It is 70% of the speed of PIII @1.2, and roughly 30% of the speed of PIV @2.60 using single core.

Dragon Mist
Parent - - By CumnorChessClub (***) Date 2007-06-26 22:33
@Dragon Mist
The statistics table you have listed, is this done manually or is there a program to produce this?
Parent - - By Dragon Mist (****) Date 2007-06-27 07:21
Rating calculation is produced by ELOSTAT 1.3, well-known and freely available program made for this exact purpose.

Dragon Mist
Parent - By CumnorChessClub (***) Date 2007-06-27 17:26
Thanks Dragon Mist
Up Topic Rybka Support & Discussion / Rybka Discussion / 10-game match RR update

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill