Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / How are rating list reliable with such small samples?
- - By rocket (****) Date 2021-02-08 13:48
You need hundreds of games between two engines to get statistically significant results, yet CEGT and CCRl generally stop around 50 games, sometimes 10-15, which is even more meaningless.
Parent - By MrKris (***) Date 2021-02-09 05:07
They are not bad, about like FIDE's or the USCF (United States) lists. (at CEGT and CCRL the testers choose, the humans play where they want).

> around 50 games, sometimes 10-15


Most important is each engine's total games in the list.

The lower number of games per opponent is related to statistical sampling:
https://en.wikipedia.org/wiki/Sampling_(statistics)

Compare these 3: (CEGT is good but their variety of cores make their lists harder to read)
CCRL 40/15   http://ccrl.chessdom.com/ccrl/4040/
SP-CC.de   https://www.sp-cc.de/
fastgm.de   http://www.fastgm.de/60min.html
Parent - - By user923005 (****) Date 2021-02-09 10:33 Upvotes 1
You are right that 50 games is not enough for any statistical significance.
But if you run 50 games against 20 well measured opponents that gives you 1000 games.
It is actually better than 1000 games against a single opponent
Parent - By rocket (****) Date 2021-02-09 12:12 Edited 2021-02-09 12:20

>It is actually better than 1000 games against a single opponent<


No it isn't, and I'll show you just how misrepresentative they can be

Rybka 3 vs Deep Fritz 10.1 CCRL 40/40 (+16−4=10) for Rybka 3 equals 70%

At CEGT 40/4 (both engines have roughly the same elo difference between fast and slow tc)

Rybka 3 vs Deep Fritz 10.1 +143- 16  = 41  82.3%  for Rybka 3 when we get up to hundreds of games.....

12% difference in score comparing hundreds of games and 30.....

12% is a lot... that's 98+ elo performance for Fritz with only 30 games.
Parent - By MrKris (***) Date 2021-02-10 01:13

> It is actually better than 1000 games against a single opponent


Better, like a FIDE player playing in various events better than that same total number of games in a match vs. only 1 other player.
Parent - - By Vegan (****) Date 2021-02-16 02:56
I have looked into this deeply. I have a math background but the ratings system is comparatively insensitive so I am wondering if the span may need to be expanded.

FIDE spans maybe 1300 to 2800 so stretching it out might be a prudent move considering computer contestants

even with 1000 games played ratings still moved a lot
Parent - By rocket (****) Date 2021-02-16 09:40
You might recall that in the second TCEC Rybka 4.1 - Houdini match, Rybka went on a winning streak and was leading the match by two games. Everybody was starting to speculate that the 4.1 bugg fix had improved strength.

Later Houdini catched up and won the match. But that's how close it can be even with 40-70 elo difference.

The first Deep fritz -Deep Junior match in 2002, Junior was up 5-0, but still lost
Parent - By rocket (****) Date 2021-02-16 09:42
If I took small samples from 50 states in America, rather than one large in one state, would the former be more representative?
- - By rocket (****) Date 2021-02-09 12:14
Now imagine that we have 50 of these 30 games sample, and 20% have misrepresentative score for Fritz due to small sample where it scores 100(!) elo higher than it's actual estimated rating....
All of this will be mitigated if we simply have 100+ games for each opponent
Parent - By MrKris (***) Date 2021-02-10 01:30
Say 30 games vs. 100+ games per opponent:

1) The 'listers' would have to buy 2+ more times more computers (for 3+ times total) or their lists would improve less than 1/3 times their current rate.

2) That time/money vs. the low, I presume, necessity for more accurate ratings.
Up Topic The Rybka Lounge / Computer Chess / How are rating list reliable with such small samples?

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill