Harry made 200 games with HS_for_Rybka.ctg against Shebar.ctg by 3'+2".
Dual Xeon 5160 (4x3000MHz), Time control: 3'+2" , Hash 128MB, Tbs 3+4+5 and ponder=off.
100 first games:
1: Rybka 2.3.2 mp 51.0 / 100 HS_for_Rybka.ctg
2: Rybka 2.3.2a mp 49.0 / 100 Shebar.ctg
100 games: +20 =68 -12
And 100 games with invert the opening books:
1: Rybka 2.3.2a mp 54.5 / 100 HS_for_Rybka.ctg
2: Rybka 2.3.2 mp 45.5 / 100 Shebar.ctg
100 games: +18 =71 -11
PGN and comments on Le Fou numérique
The question I am asking myself is, would there be a better way to compare the strengths of two books than playing out whole games? Especially at short time controls there is a certain random factor. An engine could come out of the book with a slight advantage and still lose due to a mistake later in the game. Or would these instances cancel each other out in the long run?
Low time control: little influence.
Medium time control: bigger influence.
Long (or very long) time control: little influence.
But this is not really important.
I do think game results are more indicative than "eval out of book". You can have a good line in a book, even if the eval is not so promising right out of book.
In general, there are a zillion variables that influence the influence of an opening book, so nothing "absolute" can be said about the strength of a book. What we can say is that "under these concrete conditions" book A performs this well against book B. Blitz conditions are clearly an interesting environment to test in, since we, as humans, are hardwired to care about 5-10 minutes events.
Of course, all the above gets chucked in Freestyle, where it is almost totally irrelevant!
My point now is that if this can be done incrementally and in many lines at once, you WILL improve the book, even if you will stil fall into some "eval-traps". This is in part why I think "persistent hash" will be such a revolution with regard to opening research etc.
Improving a book by making it deeper "just" in order to save time seems like a relatively expensive approach. I think the main advantage of higher depth is stronger moves earlier on in the opening (assuming that the extra research is correctly tracked back to earlier positions).
Eval out of book seems like a reasonable metric, but if it's your only metric, you will systematically bias your book in favor of good evals versus winning games.
Regarding this second point - I have never seriously worked with opening books, but a very similar thing happened to Rybka last year. I came up with a search efficiency metric which was good but not perfect, and relied on it exclusively going from Rybka 1.1 to Rybka 1.2 and Rybka 2.0. You can look at any rating list to see the result. In particular, this metric underestimated the importance of tactics, and you can run those three Rybka versions on tactical tests to see that result. It also cost us among other things the game with Shredder in Turino.
In general, there is only one true metric in computer chess - winning. With everything else, you need to be very careful. Even small biases get amplified if you over-rely on a single metric.
/* Steinar */
I think a lose later in game does not depend on time. A mistake (or bug?) can occur in short time control or in long.
Perhaps a very short time control is an handicap because there are also exchanges between engines and GUI.
In this case there are 100 games with each configuration (200 games in all), so I think that the percentage of error decrease.
First 100 games: +17 = 68 -15
Second 100 games: +19 =71 -10
> First 100 games: +17 = 68 -15
or 49 - 51 for Rybka 2.3.2a_Shebar
but S.D. (standard deviation) is 2.8 and within 90% probability, the deviation is 1.65 x 2.8 = 4.6
So the 49 actually means 49 +- 4.5 that is a span (44.5 to 53.5)
> Second 100 games: +19 =71 -10
or 54.5 - 45.5 for Rybka 2.3.2a_HS_for_Rybka
but S.D. (standard deviation) is 2.65 and within 90% probability, the deviation is 1.65 x 2.65 = 4.4
So the 54.5 actually means 54.5 +- 4.5 that is a span (50 to 59)
You can see that there is quite an overlap between the 2 spans.:-)
There is an "indication" that Rybka 2.3.2a_HS_for_Rybka is better than Rybka 2.3.2a_Shebar, but far from a proof!
D'ailleurs ta démonstration peut très bien être faite dans l'autre sens et dans ce cas l'écart entre les deux bibliothèques est encore plus grand.
I don't have yet(i expect to start reading statistics seriously in 2008) any serious knowledge about statistics, but with my limited knowledge of now:
How exactly did you calculated the aforementioned ranges? Can you describe the method?
Because i get different results, thankfully only tiny different and by rounding to integers we get identical results:
But tiny for me means different so i want the reason for this difference.....
Calculation of the range of score of a player in a match for example HS-Sheebar that ended +a -b =c :
The mean sample value m of these results if we are interested in HS's performance is m = (a + c/2) / (a+b+c)
(If we were interested in Sheebar's performance then m = (b + c/2) / (a+b+c) )
The real mean value M and if the sample size is big enough(a+b+c>30), is between m - k·(s/SQRT(n)) <= M <= m + k·(s/SQRT(n))
- n = a+b+c the total number of games played.
- SQRT(x) is the square root of x.
- k is a factor that depends on how big you want the confidence level to be. For example for 95% you take k=1.96, for 90% => k=1.645, for 99% => k= 2.58 etc....
- s is the sample standard deviation and in our case it is calculated as s = SQRT( (a·(1-m)^2 + b·m^2 + c·(0.5-m)^2) / (n-1) )
This is because we are interested in HS's performance.
So the real mean value is inside the range: [ m - k·(s/SQRT(n)) , m + k·(s/SQRT(n)) ]
And since we played n games the expected score of HS is: [ n·(m - k·(s/SQRT(n))) , n·(m + k·(s/SQRT(n))) ]
If we apply all these to the scores we have:
First 100 games: +17 = 68 -15 with being interested in Sheebar's performance:
So the mean score = 0.49·100 = 49 (the games were 100)
s/SQRT(n) = 0.0284 ~= 0.028 exactly as yours after rounding and multiplying by 100.
So for 90% confidence level as you took, we have k=1.65 (actually it is 1.645 but since you put it 1.65, i put it the same in order to have a better comparison of the results) so the deviation is 1.65·0.0284 = 0.04686 and because we have 100 games it is 4.686 ~= 4.5 just like yours.
Second 100 games: +19 = 71 -10 with being interested in HS's performance:
So the mean score = 0.545·100 = 54.5 (the games were 100)
s/SQRT(n) = 0.0266809 ~= 0.0267. And if we multiply by 100 we have 2.67 but you have 2.65. !?!? I wonder. !?!?
So for 90% confidence level as you took, we have k=1.65 so the deviation is 1.65·0.0267 = 0.044055 and because we have 100 games it is 4.4055 ~= 4.4 the same as yours.
So can you please describe with what method you calculated these?
Perhaps in the sample standard deviation you used a a denominator the (n) and not the (n-1) that should be used in these cases.....
> and because we have 100 games it is 4.686 ~= 4.5 just like yours.
> and because we have 100 games it is 4.4055 ~= 4.4 the same as yours.
> So can you please describe with what method you calculated these?
> Perhaps in the sample standard deviation you used a a denominator the (n) and not the (n-1) that should be used in these cases.....
Seems to me we quite agree on the numbers... :-)
Like you, I think, I took the formula of the standard deviation in trinomial distribution (Wins, Losses, Draws) from Ernst A. Heinz (in ICGA Journal, June 2003). W+L+D = n games played
I just adapted it, and got s² = (W+L)/4 - (W-L)²/4n
OK, if you are a purist, throw in the factor n/(n-1) (but for n = 100 ..., s is modified by 0.5%!!!)
Remember that when you talk about standard deviation, you don't need such an accuracy: it will only give you a probability (90%, 91%..., big deal!)
May i ask u something..?
In these tours did u check book learning or plain engine vs engine??( coz i saw alot of repeating lines games )
I believe Sheebar still remain a lot of the other strong positions..but some of lines are already killed in playchess server..now a day Sheebar is over 1yrs old(including before release i using in server).. people know well up-to-date book must be stronger than old book..i agree Harry's book should be stronger than Sheebar on some lines..( e.g..B90.,B89.,etc..)..however., if want name on strongest book..need to kill totally..
p.s..in these tours if don't be count on repeating winning games...how become score???
even i must be wrong...i have only 18 month expriences in comchess world..
BTW thanks for the testing games..
A genuine book is tailored for Rybka for tournaments against all other engines on all possible conditions.
Shebar is a good book for blitz on the server of Playchess and with an update it can become still better.
You write: "a strongest book need to kill totally" (??),sorry Aung but that is naive and reflect only the
mentality of the players "Rybka against Rybka" in playchess(a competition of killer-moves!).
The HS_for_Rybka.ctg is adapted for all time controls from blitz to long,also for the server (hopefully good...)
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill