Not logged inRybka Chess Community Forum
Previous Next Up Topic Rybka Support & Discussion / Rybka Discussion / Rybka's 3 ELO in CCRL after some amazing results.... (6078 hits)
- - By George Tsavdaris (****) Date 2008-07-14 16:51
.....that i imagined. :-) Title is misleading i know :-P

Imagine Rybka 3 to have the following EXTRA AMAZING results in CCRL.
It should be noted that Rybka 3 obviously will not have so good results, not even close, but i just wanted to see what ELO an almost GOD engine would have.

Rybka 3 64-bit 4CPU - Naum 3.1 64-bit 4CPU        -----> +42  =8  -0  ||Rybka 2.3.2a ----> +18 =28 -4
Rybka 3 64-bit 4CPU - Zappa Mexico II 64-bit 4CPU -----> +36  =7  -0  ||Rybka 2.3.2a ----> +17 =20 -6
Rybka 3 64-bit 4CPU - Naum 3 64-bit 4CPU          -----> +25  =7  -0  ||Rybka 2.3.2a ----> +7  =23 -2
Rybka 3 64-bit 4CPU - Zappa Mexico 64-bit 4CPU    -----> +100 =20 -0  ||Rybka 2.3.2a ----> +48 =86 -25
Rybka 3 64-bit 4CPU - Toga II 1.4.1SE 4CPU        -----> +29  =6  -0  ||Rybka 2.3.2a ----> +19 =12 -5
Rybka 3 64-bit 4CPU - Hiarcs 12 4CPU              -----> +45  =10 -0  ||Rybka 2.3.2a ----> +26 =28 -1
Rybka 3 64-bit 4CPU - Hiarcs Paderborn 2007 4CPU  -----> +30  =8  -0  ||Rybka 2.3.2a ----> +19 =16 -3
Rybka 3 64-bit 4CPU - Deep Fritz 10.1 4CPU        -----> +28  =5  -0  ||Rybka 2.3.2a ----> +19 =12 -2
Rybka 3 64-bit 4CPU - Glaurung 2.1 64-bit 4CPU    -----> +45  =7  -0  ||Rybka 2.3.2a ----> +29 =20 -3
Rybka 3 64-bit 4CPU - Toga II 1.3.1               -----> +29  =3  -0  ||Rybka 2.3.2a ----> +22 =10 -0
Rybka 3 64-bit 4CPU - Deep Sjeng 2.7 4CPU         -----> +36  =4  -0  ||Rybka 2.3.2a ----> +27 =11 -2
Rybka 3 64-bit 4CPU - Fruit 051103                -----> +39  =5  -0  ||Rybka 2.3.2a ----> +2  =1  -0
Rybka 3 64-bit 4CPU - Glaurung 1.2.1 64-bit 4CPU  -----> +49  =4  -0  ||Rybka 2.3.2a ----> +40 =13 -0
Rybka 3 64-bit 4CPU - Chess Tiger 2007.1          -----> +34  =3  -0  ||Rybka 2.3.2a ----> +3  =0  -0

WOW! Incredible results someone would say. 3800 ELO? 3900 ELO? Well not quite....:

Here is the calculated ELO with the "Bayeselo" program(Elostat gives only 7 points more for Rybka 3) that CCRL uses (i have calibrated the results to 3130 ELO that Rybka 2.3.2a 64-bit 4 CPU has on CCRL):

Rank      Name                             ELO    +    -   Games  Score
1    Rybka 3 64-bit 4CPU                    3387  36   33   664   93%  
2    Rybka 2.3.2a 64-bit 4CPU               3130  15   15  1500   72%  
3    Rybka 2.2 64-bit 4CPU                  3112  30   29   403   73%  
4    Rybka 2.3.2a 64-bit 2CPU               3091  17   17  1181   73%  
5    Rybka 2.1 64-bit 4CPU                  3086  39   37   248   73%  
6    Rybka 2.1 64-bit 2CPU                  3077  26   25   571   73%  
7    Naum 3.1 64-bit 4CPU                   3074  23   23   607   57%  
8    Zappa Mexico II 64-bit 4CPU            3073  20   20   821   61%  
9    Rybka 2.3.2a 64-bit                    3069  12   12  2376   72%  
10   Naum 3 64-bit 4CPU                     3068  19   19   929   60% 
11   Zappa Mexico 64-bit 4CPU               3065  16   16  1358   58%


One would be disappointed. "Only" 3387?
The point of all these is to show that if your opponents are not so strong then your ELO can't become too high(from a standard top player) due to the drawish nature of Chess as we play it(due to out inability to play it in a perfect way-although this, if Chess is a draw, leads to the opposite effect).
Remember that ELO justs measures the performance of a Chess player based on the results of the games he played against some other players. If the other players are weak his ELO can't become too high(while Rybka's results in the rating lists don't confirm that).

So people that would expect a super engine with 4000 ELO(always based on the 3130 ELO that Rybka 2.3.2a 64-bit 4 CPU) should wait long enough until the opponents of Rybka become much stronger.

And what if we break CCRL rules and match Rybka 3 with Rybka 2.3.2a? Then we would have:
Rybka 3 64-bit 4CPU - Rybka 2.3.2 64-bit 4CPU ---> +16 =20 -4 (+108 ELO, a result that agrees with Larry's results)

And then the ratings would be:
Rank     Name                                Elo    +    -   games score
1     Rybka 3 64-bit 4CPU                    3363  33   31   704   91% 
2     Rybka 2.3.2a 64-bit 4CPU               3130   15   15  1540   71% 
3     Rybka 2.2 64-bit 4CPU                  3109   30   29   403   73% 
4     Rybka 2.3.2a 64-bit 2CPU               3087   17   17  1181   73% 
5     Rybka 2.1 64-bit 4CPU                  3082   39   37   248   73% 
6     Rybka 2.1 64-bit 2CPU                  3073   26   25   571   73% 
7     Zappa Mexico II 64-bit 4CPU            3069   20   20   821   61% 
8     Naum 3.1 64-bit 4CPU                   3069   23   23   607   57% 
9     Rybka 2.3.2a 64-bit                    3065   12   12  2376   72% 
10    Naum 3 64-bit 4CPU                     3064   19   19   929   60% 
11    Zappa Mexico 64-bit 4CPU               3061   16   16  1358   58% 


(PS: I want to thank Kirill Kryukov that helped me learning about how CCRL calculates its ratings.)
Parent - - By Jeroen (****) [nl] Date 2008-07-14 17:19
Interesting! Can you recalculate the rating if all opponents score only one draw each?
Parent - - By George Tsavdaris (****) Date 2008-07-14 17:51
Yes thanks for the remind. I wanted to do it but i forgot about it.
So here it is(again with base ELO the 3130 of Rybka 2.3.2a 64-bit 4 CPU):
(I remind that i put  Rybka 3 to have against all the 15 programs i mentioned earlier, all wins except one draw with black)

Rank       Name                             Elo    +    -   games score
1    Rybka 3 64-bit 4CPU                    3706   88   70   704   99%
2    Rybka 2.3.2a 64-bit 4CPU               3130   15   15  1540   70%
3    Rybka 2.2 64-bit 4CPU                  3112   30   29   403   73%
4    Rybka 2.3.2a 64-bit 2CPU               3091   17   17  1181   73%
5    Rybka 2.1 64-bit 4CPU                  3086   39   37   248   73%
6    Rybka 2.1 64-bit 2CPU                  3076   26   25   571   73%
7    Naum 3.1 64-bit 4CPU                   3075   24   24   607   57%
8    Zappa Mexico II 64-bit 4CPU            3074   20   20   821   60%
9    Rybka 2.3.2a 64-bit                    3069   12   12  2376   72%
10   Naum 3 64-bit 4CPU                     3067   19   19   929   60%
11   Zappa Mexico 64-bit 4CPU               3066   16   16  1358   57%
Parent - By roblin (**) [se] Date 2008-07-14 18:17
head asplode
Parent - By turbojuice1122 (Gold) [us] Date 2008-07-14 22:02
I think it would be amusing to distribute this list all around the internet.  Perhaps if I make enough links to it in various posts, it will score high on a Google search. :-)
Parent - By Roland Rösler (****) [de] Date 2008-07-15 00:49
It´s amazingly to see how confidence interval increase when you have a score of 99%. And the very best is that the + barrier is higher than the - barrier (88 vs. 70).
Parent - By Vempele (Silver) [fi] Date 2008-07-14 19:07

> And what if we break CCRL rules and match Rybka 3 with Rybka 2.3.2a?


And what rule would that break, exactly? Don't play imaginary matches between versions of the same engine? There's plenty of hot Rybka incest on CCRL, you know... :-p
Parent - - By lkaufman (*****) Date 2008-07-14 22:52
This inspired me to run a bullet match of Rybka 3 beta quad against Crafty 20.14 (with short fixed openings and alternating sides), which was rated 2629 CCRL (standard) and 2605 CEGT (blitz), with contempt set to 50 to try to avoid draws. I wondered how this would affect the rating. After 24 games (small sample, just to illustrate; I'll let it run) the score is 23 wins for Rybka 3 and one draw. This is reported as +668 Elo. The predicted CCRL rating would thus be 3297 and CEGT would be 3270, both about fifty higher than I actually expect to get. So based on this small sample, it is not any harder to get a big rating against relatively weak opponents. To put it another way, some day we should have a program that beats the engines you cite as badly as Rybka 3 beats Crafty; after all Crafty is not that bad. Then we can get a rating like 3700.  A 98% result against today's top non-Rybka programs sounds impossible, but it will happen in the coming decade (assuming only short opening books like CCRL and CEGT use). 
Parent - - By Uri Blass (****) [il] Date 2008-07-15 00:27
I am not sure what is going to happen.
The main problem is that you may need to play against stronger players and it may be harder or even impossible to achieve 90% against 2010's top non rybka engines and you may need 90% in order to get 3700.

Uri
Parent - - By lkaufman (*****) Date 2008-07-15 01:58
You could be right, but at least I see no evidence that gross mismatches produce lower ratings than reasonable ones, at least if adjusting contempt is allowed. The test stopped after 141 games, with 135 wins for Rybka 3 beta quad, 2 wins for Crafty 20.14, and four draws. This 97.2% score translated to +613 Elo, which when added to the Crafty ratings gives us 3242 CCRL and 3218 CEGT, both remarkably close to my estimates for Rybka 3 quad. For another datapoint, I ran Rybka 3 beta octal against Hiarcs 10, a much more formidable opponent but still a huge mismatch. Hiarcs 10 has blitz ratings of 2833 CCRL and 2776 CEGT. Contempt set to 25. As of now, the result is 96 wins for Rybka, 17 draws, and no losses (!). This is 92.4% for +435 Elo. Adding this to the ratings gives 3278 for CCRL and 3211 CEGT, again very close to what I would expect Rybka 3 Octal to get on those lists if they decide to rate octals. So it seems to me that the Elo rating system works astoundingly well in this environment, and that the rating a computer ends up with has little to do with how close the pairings are. This disagrees with what I had thought previously; unlike some politicians I have no problem changing my opinion in the face of new evidence.
Parent - - By Roland Rösler (****) [de] Date 2008-07-15 03:03
Why not a match with Rybka 3 sp against opponents (what you have: Zappa, Naum, Shredder, Fritz, Toga, etc.) on 8 (or 4) cores. It´s a hard race, but Rybka 3 has to win them all (because of predicted 3130 Elo for Rybka 3 sp). Here you can also see how better eval works against much deeper search. You have not to publish the results; only for your estimation.
Parent - - By lkaufman (*****) Date 2008-07-15 03:57
Yes, I think I'll try this at least with Deep Shredder 11 and Deep Fritz 10.1. Right now I'm learning a lot about our contempt factor, which will play a major role in Rybka now. 
Parent - - By George Tsavdaris (****) Date 2008-07-15 08:50
I wonder if you have to ask CEGT and CCRL lists to better test Rybka 3 with a contempt factor.
Perhaps they can test both contempt=0 and with contempt= a value you will propose and see who will be better. That would be interesting....
Parent - - By lkaufman (*****) Date 2008-07-15 13:28
I'm already quite convinced that Rybka 3 will get a better rating with contempt than without. Some setting will be the default. Of course if testers want to test Rybka 3 without contempt they can do so, but the result should be clearly indicated as a non-approved setting (something like "contempt reduced to zero").
Parent - - By Uri Blass (****) [il] Date 2008-07-15 13:34
I wonder if it is not dependent on the opponents.

It is possible that in 2012 when maybe there are going to be some programs that are stronger than rybka3 we are going to find that contempt=0 have higher rating than the default version.

Uri
Parent - By lkaufman (*****) Date 2008-07-15 13:38
Yes, of course that is true, unless we make contempt dependent on input rating of opponent. But by then we won't care about Rybka 3!
Parent - - By lkaufman (*****) Date 2008-07-15 04:18
Update: First, I ran the same test with Fritz 5.32 substituting for Crafty. Fritz 5.32 is rated slightly higher (2628 CEGT blitz vs. 2605 for Crafty). As of now, the score is 87 to 1 with no draws (!) for +775 Elo=3403 CEGT!! I'm also rerunning the Hiarcs match, which finished at 121-0 with 19 draws, but this time with contempt raised from 25 to 40. This seems to make a real difference; now the score is 62-1 with 3 draws, for +561 Elo=3394 CCRL, 3337 CEGT! This indicates that contempt works perfectly; the increase cut the draw percentage to a third of what they were, but Rybka actually lost a game this time! The huge ratings we are getting from these mismatches seems to show conclusively that it is not harmful to play weak opponents, and also that contempt now plays a big role in ratings. We won't actually get ratings like these numbers, because we have to choose a default contempt which will be based on our strongest likely opponents. Ideally, we could have a place to enter the opponent's rating, and the program would calculate the optimum contempt factor based on that, but I don't know if that would be allowed by the testing organizations, and there is also the problem that they have different scales (CCRL ratings run about 50 higher than CEGT).
Parent - - By Vinvin (***) [be] Date 2008-07-15 08:50

> because we have to choose a default contempt which will be based on our strongest likely opponents.


Note that it changes during months : you could set a high default contempt now but it sould be lower (because of stronger opponents) in 1 year and still decreasing each months ...

> Ideally, we could have a place to enter the opponent's rating, and the program would calculate the optimum contempt factor based on that, but I don't know if that would be allowed by the testing organizations, and there is also the problem that they have different scales (CCRL ratings run about 50 higher than CEGT).


In high level chess , it's very very common to know about the strenght of your opponent, so why not in computer chess ?
Parent - - By lkaufman (*****) Date 2008-07-15 13:25
I am asking the testing organizations about this now. There are some practical problems though.
Parent - - By Vinvin (***) [be] Date 2008-07-15 17:15
May be they could separate the 2 ratings (one with C.F. set and one wihtout)
Parent - By lkaufman (*****) Date 2008-07-15 17:44
Definitely they should be separated if they choose to test both ways; if not I would expect them to test the default, which will probably be contempt=15.
Parent - - By JohnL (***) Date 2008-07-15 09:01
This is very interesting and impressive!

The problem of losing ELO points when playing weaker opponents is well known. And it seems to be a fact, Jeff Sonas claims that it is due to a flaw in the ELO formula definition.
Actually, the way FIDE calculates ELO is pretty archaic and from the pre-computer era.

What contempt should you use for playing 2200 players in blitz, 300? :-)
Parent - - By lkaufman (*****) Date 2008-07-15 13:34
I think that the issue Sonas writes about does not apply (or is negligible) in the CCRL/CEGT environment, where the players do not change in strength and the number of games between them is huge. Regarding contempt, currently it looks to me like the best value can be approximated by rating difference/15, so if we assume that a 2200 player is about 1200 below Rybka 3 in blitz (since blitz favors computers over humans), this would suggest a value of 80. We may make the engine calculate contempt based on the rating you input.
Parent - - By vermillion (**) [ca] Date 2008-07-15 16:38

>the best value can be approximated by rating difference/15


Hi Larry,
Would your contempt formula work when playing against stronger opponents?
ie. when playing an opponent that is 200 elo higher; a contempt setting of -13 would give better results.
-v
Parent - - By lkaufman (*****) Date 2008-07-15 16:54
Logically it should, but of course there is no way to test this since no such opponent exists.
Parent - - By vermillion (**) [ca] Date 2008-07-15 17:26
Maybe not 200 elo but I was thinking of R3 dual vs R3 8 core.
Parent - By lkaufman (*****) Date 2008-07-15 17:50
I could test this, but there are some issues with testing the same program against itself on better hardware, so I'm not sure that it would give us a reliable answer. I already established that contempt=10 is quite helpful to Rybka 3 vs. Rybka 2.32a on my octal, so I would expect that a negative value would help if we could test against Rybka 4 for example. Everything is symmetrical about contempt, so I don't see why this shouldn't be true.
Parent - - By Vempele (Silver) [fi] Date 2008-07-15 07:38

> This 97.2% score translated to +613 Elo, which when added to the Crafty ratings gives us 3242 CCRL and 3218 CEGT, both remarkably close to my estimates for Rybka 3 quad.


http://www.open-aurec.com/wbforum/viewtopic.php?t=6587
You need more games.
Parent - - By lkaufman (*****) Date 2008-07-15 13:23
Of course I need more games to give an accurate rating, but overall I've run several hundred games against opponents who are rated more than 400 below Rybka 3's expected rating, and the conclusion is pretty consistent that with contempt properly set the indicated rating for Rybka 3 is about equal or higher to what she gets from more reasonable pairings. I don't think that this conclusion requires more games.
Parent - - By FWCC (***) [us] Date 2008-07-15 13:27
Will setting the contempt in version 2.3.2a against lesser opponents yield similiar results Larry?
Parent - By lkaufman (*****) Date 2008-07-15 13:36
Setting contempt in 2.3.2a has far fewer consequences than in Rybka 3. Presumably setting a value would help slightly even in 2.3.2a, but maybe only very slightly.
Parent - - By 8lrr8 (***) Date 2008-07-15 07:04
"...(assuming only short opening books like CCRL and CEGT use)."

agreed.  but unfortunately it's quite the arms race in the opening theory realm.  it wont be long until we start seeing pretty much all major lines being strongly booked 30-moves (60-ply) deep.  that will really plummet the elo rating of the top program.
Parent - - By lkaufman (*****) Date 2008-07-15 13:16
Not if they exit theory in the first five moves with rare moves.
Parent - - By 8lrr8 (***) Date 2008-07-15 14:45
but isnt there usually a reason those moves are rare?  i'd imagine more times than not they arent particularly good and dont take u anywhere promising esp. if it's done w/i the first 5 moves.
Parent - By lkaufman (*****) Date 2008-07-15 15:31
Yes, that is true, but my testing shows that a much stronger program can even give pawn odds or more to a somewhat weaker one, so it's not such a big deal to concede a slight plus by making rare moves. Basically, playing rare sidelines will still give White at least equality, so the cost is about equal to White's advantage. Presumably the cost for Black to play sidelines is just slightly more. So we can estimate that if the strength difference between two top programs is about a class (200 Elo), the top engine will lose forty Elo playing sidelines vs. having openings chosen randomly from short books. But I think that if the engines both stay in book for a long time, and use the same or equal books, the loss for the stronger program is more than that as much of the game is already over. So I think that once all the engines come with super strong and deep books, playing sidelines will give the top engines higher ratings than playing mainlines.
Parent - - By SillyFunction (**) [th] Date 2008-07-15 02:08
You imagination is so lucid :-)
Parent - By M ANSARI (****) [kw] Date 2008-07-15 06:41
Very interesting ... I always thought that contempt was one item that did not have a lot of thought put into it and could actually be used to a big advantage if used correctly.  This seems true.  It makes no sense for a 3200 ELO program to be happy and draw a 2400 ELO program by offering a draw simply because it is -.05 down in evaluation.
Previous Next Up Topic Rybka Support & Discussion / Rybka Discussion / Rybka's 3 ELO in CCRL after some amazing results.... (6078 hits)

Powered by mwForum 2.22.1 © 1999-2010 Markus Wichitill