Hi,

Harry made 200 games with HS_for_Rybka.ctg against Shebar.ctg by 3'+2".

Dual Xeon 5160 (4x3000MHz), Time control: 3'+2" , Hash 128MB, Tbs 3+4+5 and ponder=off.

100 first games:

And 100 games with invert the opening books:

PGN and comments on Le Fou numérique

Regards,

Patrick

Harry made 200 games with HS_for_Rybka.ctg against Shebar.ctg by 3'+2".

Dual Xeon 5160 (4x3000MHz), Time control: 3'+2" , Hash 128MB, Tbs 3+4+5 and ponder=off.

100 first games:

` Score `

---------------------------------------------------

1: Rybka 2.3.2 mp 51.0 / 100 HS_for_Rybka.ctg

2: Rybka 2.3.2a mp 49.0 / 100 Shebar.ctg

---------------------------------------------------

100 games: +20 =68 -12

And 100 games with invert the opening books:

` Score `

---------------------------------------------------

1: Rybka 2.3.2a mp 54.5 / 100 HS_for_Rybka.ctg

2: Rybka 2.3.2 mp 45.5 / 100 Shebar.ctg

---------------------------------------------------

100 games: +18 =71 -11

PGN and comments on Le Fou numérique

Regards,

Patrick

This is interesting, as I am experimenting with those two books too.

The question I am asking myself is, would there be a better way to compare the strengths of two books than playing out whole games? Especially at short time controls there is a certain random factor. An engine could come out of the book with a slight advantage and still lose due to a mistake later in the game. Or would these instances cancel each other out in the long run?

The question I am asking myself is, would there be a better way to compare the strengths of two books than playing out whole games? Especially at short time controls there is a certain random factor. An engine could come out of the book with a slight advantage and still lose due to a mistake later in the game. Or would these instances cancel each other out in the long run?

I think they should cancel each other out in the long run. Don't worry too much about time control. I have a little theory that the influence of books as a function of time control will show as some kind of bell formed graph, that is:

Low time control: little influence.

Medium time control: bigger influence.

Long (or very long) time control: little influence.

But this is not really important.

I do think game results are more indicative than "eval out of book". You can have a good line in a book, even if the eval is not so promising right out of book.

In general, there are a zillion variables that influence the influence of an opening book, so nothing "absolute" can be said about the strength of a book. What we can say is that "under these concrete conditions" book A performs this well against book B. Blitz conditions are clearly an interesting environment to test in, since we, as humans, are hardwired to care about 5-10 minutes events.

Low time control: little influence.

Medium time control: bigger influence.

Long (or very long) time control: little influence.

But this is not really important.

I do think game results are more indicative than "eval out of book". You can have a good line in a book, even if the eval is not so promising right out of book.

In general, there are a zillion variables that influence the influence of an opening book, so nothing "absolute" can be said about the strength of a book. What we can say is that "under these concrete conditions" book A performs this well against book B. Blitz conditions are clearly an interesting environment to test in, since we, as humans, are hardwired to care about 5-10 minutes events.

Personally I swear by "eval out of book". It is one thing to be considered among several. I've done a lot of private testing on books in the past using an elaborate spreadsheet to record various metrics. While I agree book-exit evaluation is unreliable in any individual game, or even small group of games, if you look at very large numbers of games you can see some very well-defined patterns. It's really clear that if you stay in book longer than your opponent you are generally better off, if for nothing else the clock advantage (every move deeper than your opponent is like clocking down his CPU by ~3%). It's even more clear that there is a near-linear advantage that is gained by a positive book-exit evaluation. I.e. exiting book with +0.25 has a drastically better success rate than exiting with +0.05, let alone a negative eval. Plotting this on a graph is an eye-opener, and leads inexorably to other even more advanced conclusions about the bookmaker's art.

Of course, all the above gets chucked in Freestyle, where it is almost totally irrelevant!

Of course, all the above gets chucked in Freestyle, where it is almost totally irrelevant!

Thanks, very interesting. We can agree that eval out of book can be misleading in an individual opening line, but nevertheless, the strong correlation between eval and eventual result opens up for a valid large-scale strategy to improve an opening book: Aim to increase the average eval out of book.

My point now is that if this can be done incrementally and in many lines at once, you WILL improve the book, even if you will stil fall into some "eval-traps". This is in part why I think "persistent hash" will be such a revolution with regard to opening research etc.

Improving a book by making it deeper "just" in order to save time seems like a relatively expensive approach. I think the main advantage of higher depth is stronger moves earlier on in the opening (assuming that the extra research is correctly tracked back to earlier positions).

My point now is that if this can be done incrementally and in many lines at once, you WILL improve the book, even if you will stil fall into some "eval-traps". This is in part why I think "persistent hash" will be such a revolution with regard to opening research etc.

Improving a book by making it deeper "just" in order to save time seems like a relatively expensive approach. I think the main advantage of higher depth is stronger moves earlier on in the opening (assuming that the extra research is correctly tracked back to earlier positions).

You're a very good student, but then again, what else could I expect. Too bad we're not on the same team; obviously there's a lot more stuff I am holding back just as you are!

Generally, I expect the importance of book to increase as the time control increases (or hardware improves). Of course, as the time control increases, draws increase, so it's possible that the Elo curve for a good book would be bell-shaped, although I would bet against it.

Eval out of book seems like a reasonable metric, but if it's your only metric, you will systematically bias your book in favor of good evals versus winning games.

Regarding this second point - I have never seriously worked with opening books, but a very similar thing happened to Rybka last year. I came up with a search efficiency metric which was good but not perfect, and relied on it exclusively going from Rybka 1.1 to Rybka 1.2 and Rybka 2.0. You can look at any rating list to see the result. In particular, this metric underestimated the importance of tactics, and you can run those three Rybka versions on tactical tests to see that result. It also cost us among other things the game with Shredder in Turino.

In general, there is only one true metric in computer chess - winning. With everything else, you need to be very careful. Even small biases get amplified if you over-rely on a single metric.

Vas

Eval out of book seems like a reasonable metric, but if it's your only metric, you will systematically bias your book in favor of good evals versus winning games.

Regarding this second point - I have never seriously worked with opening books, but a very similar thing happened to Rybka last year. I came up with a search efficiency metric which was good but not perfect, and relied on it exclusively going from Rybka 1.1 to Rybka 1.2 and Rybka 2.0. You can look at any rating list to see the result. In particular, this metric underestimated the importance of tactics, and you can run those three Rybka versions on tactical tests to see that result. It also cost us among other things the game with Shredder in Turino.

In general, there is only one true metric in computer chess - winning. With everything else, you need to be very careful. Even small biases get amplified if you over-rely on a single metric.

Vas

I totally agree with what you're saying without a single quibble. The trick in any book is to properly balance empirical results with other metrics, such as evaluations. Note the importance of empirical results scale down as N goes lower, and the value of evaluations scale up the deeper the actual or imputed ply-depth of the evaluation. In my comments above I wasn't referring to how a book should be structured to play, but rather how test games should be evaluated from a book-effectiveness standpoint.

Good point about the relative importances. I'm pretty sure you should be able to quantify this statistically, at least if you can assign a reasonable standard deviance to each value, but then again, others might be doing this already :-)

/* Steinar */

/* Steinar */

By the way, when Vasik, Dagh and myself are chatting about something like this I feel like we're at a summit conference.

Thank you for the information. I will do more tests at short time controls. My first impression of the HS_for_Rybka.ctg book is very positive, especially compared to other engine's books like the Deep Fritz 10 book.

Hi,

I think a lose later in game does not depend on time. A mistake (or bug?) can occur in short time control or in long.

Perhaps a very short time control is an handicap because there are also exchanges between engines and GUI.

In this case there are 100 games with each configuration (200 games in all), so I think that the percentage of error decrease.

Patrick

I think a lose later in game does not depend on time. A mistake (or bug?) can occur in short time control or in long.

Perhaps a very short time control is an handicap because there are also exchanges between engines and GUI.

In this case there are 100 games with each configuration (200 games in all), so I think that the percentage of error decrease.

Patrick

Hi,

First 100 games: +17 = 68 -15

Second 100 games: +19 =71 -10

Harry

First 100 games: +17 = 68 -15

Second 100 games: +19 =71 -10

Harry

Let's do some statistics...

or 49 - 51 for Rybka 2.3.2a_Shebar

but S.D. (standard deviation) is 2.8 and within 90% probability, the deviation is 1.65 x 2.8 = 4.6

So the 49 actually means 49 +- 4.5 that is a

or 54.5 - 45.5 for Rybka 2.3.2a_HS_for_Rybka

but S.D. (standard deviation) is 2.65 and within 90% probability, the deviation is 1.65 x 2.65 = 4.4

So the 54.5 actually means 54.5 +- 4.5 that is a

You can see that there is quite an overlap between the 2 spans.:-)

> First 100 games: +17 = 68 -15

or 49 - 51 for Rybka 2.3.2a_Shebar

but S.D. (standard deviation) is 2.8 and within 90% probability, the deviation is 1.65 x 2.8 = 4.6

So the 49 actually means 49 +- 4.5 that is a

**span (44.5 to 53.5)**> Second 100 games: +19 =71 -10

or 54.5 - 45.5 for Rybka 2.3.2a_HS_for_Rybka

but S.D. (standard deviation) is 2.65 and within 90% probability, the deviation is 1.65 x 2.65 = 4.4

So the 54.5 actually means 54.5 +- 4.5 that is a

**span (50 to 59)**You can see that there is quite an overlap between the 2 spans.:-)

**There is an "indication" that Rybka 2.3.2a_HS_for_Rybka is better than Rybka 2.3.2a_Shebar, but far from a proof!**

95Mb against 512Mb,that is a clear proof !

Tu veux dire que 95MB, c'est + compact, donc c'est mieux ?

Non je veux dire qu'en 95Mo, HS_for_Rybka.ctg fait mieux.

D'ailleurs ta démonstration peut très bien être faite dans l'autre sens et dans ce cas l'écart entre les deux bibliothèques est encore plus grand.

D'ailleurs ta démonstration peut très bien être faite dans l'autre sens et dans ce cas l'écart entre les deux bibliothèques est encore plus grand.

Hi,

I don't have yet(i expect to start reading statistics seriously in 2008) any serious knowledge about statistics, but with my limited knowledge of now:

How exactly did you calculated the aforementioned ranges? Can you describe the method?

Because i get different results, thankfully only tiny different and by rounding to integers we get identical results:

But tiny for me means different so i want the reason for this difference.....

The mean

(If we were interested in Sheebar's performance then m = (b + c/2) / (a+b+c) )

The

Where:

- n = a+b+c the total number of games played.

- SQRT(x) is the square root of x.

- k is a factor that depends on how big you want the confidence level to be. For example for 95% you take k=1.96, for 90% => k=1.645, for 99% => k= 2.58 etc....

- s is the sample standard deviation and in our case it is calculated as s = SQRT( (a·(1-m)^2 + b·m^2 + c·(0.5-m)^2) / (n-1) )

^^^^^^^^^^^^

This is because we are interested in HS's performance.

So the real mean value is inside the range: [ m - k·(s/SQRT(n)) , m + k·(s/SQRT(n)) ]

And since we played n games the expected score of HS is: [ n·(m - k·(s/SQRT(n))) , n·(m + k·(s/SQRT(n))) ]

If we apply all these to the scores we have:

m=0.49

So the mean score = 0.49·100 = 49 (the games were 100)

s/SQRT(n) = 0.0284 ~= 0.028 exactly as yours after rounding and multiplying by 100.

So for 90% confidence level as you took, we have k=1.65 (actually it is 1.645 but since you put it 1.65, i put it the same in order to have a better comparison of the results) so the deviation is 1.65·0.0284 = 0.04686 and because we have 100 games it is 4.686 ~= 4.5 just like yours.

m=0.545

So the mean score = 0.545·100 = 54.5 (the games were 100)

s/SQRT(n) = 0.0266809 ~= 0.0267. And if we multiply by 100 we have 2.67 but you have 2.65. !?!? I wonder. !?!?

So for 90% confidence level as you took, we have k=1.65 so the deviation is 1.65·0.0267 = 0.044055 and because we have 100 games it is 4.4055 ~= 4.4 the same as yours.

So can you please describe with what method you calculated these?

Perhaps in the sample standard deviation you used a a denominator the (n) and not the (n-1) that should be used in these cases.....

I don't have yet(i expect to start reading statistics seriously in 2008) any serious knowledge about statistics, but with my limited knowledge of now:

How exactly did you calculated the aforementioned ranges? Can you describe the method?

Because i get different results, thankfully only tiny different and by rounding to integers we get identical results:

But tiny for me means different so i want the reason for this difference.....

**Calculation of the range of score of a player in a match for example HS-Sheebar that ended +a -b =c :**The mean

**sample**value m of these results if we are interested in HS's performance is m = (a + c/2) / (a+b+c)(If we were interested in Sheebar's performance then m = (b + c/2) / (a+b+c) )

The

**real**mean value M and if the sample size is big enough(a+b+c>30), is between m - k·(s/SQRT(n)) <= M <= m + k·(s/SQRT(n))Where:

- n = a+b+c the total number of games played.

- SQRT(x) is the square root of x.

- k is a factor that depends on how big you want the confidence level to be. For example for 95% you take k=1.96, for 90% => k=1.645, for 99% => k= 2.58 etc....

- s is the sample standard deviation and in our case it is calculated as s = SQRT( (a·(1-m)^2 + b·m^2 + c·(0.5-m)^2) / (n-1) )

^^^^^^^^^^^^

This is because we are interested in HS's performance.

So the real mean value is inside the range: [ m - k·(s/SQRT(n)) , m + k·(s/SQRT(n)) ]

And since we played n games the expected score of HS is: [ n·(m - k·(s/SQRT(n))) , n·(m + k·(s/SQRT(n))) ]

If we apply all these to the scores we have:

**First 100 games: +17 = 68 -15 with being interested in Sheebar's performance:**m=0.49

So the mean score = 0.49·100 = 49 (the games were 100)

s/SQRT(n) = 0.0284 ~= 0.028 exactly as yours after rounding and multiplying by 100.

So for 90% confidence level as you took, we have k=1.65 (actually it is 1.645 but since you put it 1.65, i put it the same in order to have a better comparison of the results) so the deviation is 1.65·0.0284 = 0.04686 and because we have 100 games it is 4.686 ~= 4.5 just like yours.

**Second 100 games: +19 = 71 -10 with being interested in HS's performance:**m=0.545

So the mean score = 0.545·100 = 54.5 (the games were 100)

s/SQRT(n) = 0.0266809 ~= 0.0267. And if we multiply by 100 we have 2.67 but you have 2.65. !?!? I wonder. !?!?

So for 90% confidence level as you took, we have k=1.65 so the deviation is 1.65·0.0267 = 0.044055 and because we have 100 games it is 4.4055 ~= 4.4 the same as yours.

So can you please describe with what method you calculated these?

Perhaps in the sample standard deviation you used a a denominator the (n) and not the (n-1) that should be used in these cases.....

> and because we have 100 games it is 4.686 ~= 4.5 just like yours.

>

> and because we have 100 games it is 4.4055 ~= 4.4 the same as yours.

>

> So can you please describe with what method you calculated these?

> Perhaps in the sample standard deviation you used a a denominator the (n) and not the (n-1) that should be used in these cases.....

Hi George,

Seems to me we quite agree on the numbers... :-)

Like you, I think, I took the formula of the standard deviation in trinomial distribution (Wins, Losses, Draws) from Ernst A. Heinz (in ICGA Journal, June 2003). W+L+D = n games played

I just adapted it, and got

**s² = (W+L)/4 - (W-L)²/4n**

OK, if you are a purist, throw in the factor n/(n-1) (but for n = 100 ..., s is modified by 0.5%!!!)

Remember that when you talk about standard deviation, you don't need such an accuracy: it will only give you a probability (90%, 91%..., big deal!)

Hi..Patrick...

May i ask u something..?

In these tours did u check book learning or plain engine vs engine??( coz i saw alot of repeating lines games )

I believe Sheebar still remain a lot of the other strong positions..but some of lines are already killed in playchess server..now a day Sheebar is over 1yrs old(including before release i using in server).. people know well up-to-date book must be stronger than old book..i agree Harry's book should be stronger than Sheebar on some lines..( e.g..B90.,B89.,etc..)..however., if want name on strongest book..need to kill totally..

p.s..in these tours if don't be count on repeating winning games...how become score???

even i must be wrong...i have only 18 month expriences in comchess world..

BTW thanks for the testing games..

May i ask u something..?

In these tours did u check book learning or plain engine vs engine??( coz i saw alot of repeating lines games )

I believe Sheebar still remain a lot of the other strong positions..but some of lines are already killed in playchess server..now a day Sheebar is over 1yrs old(including before release i using in server).. people know well up-to-date book must be stronger than old book..i agree Harry's book should be stronger than Sheebar on some lines..( e.g..B90.,B89.,etc..)..however., if want name on strongest book..need to kill totally..

p.s..in these tours if don't be count on repeating winning games...how become score???

even i must be wrong...i have only 18 month expriences in comchess world..

BTW thanks for the testing games..

Hi Aung,

A genuine book is tailored for Rybka for tournaments against all other engines on all possible conditions.

Shebar is a good book for blitz on the server of Playchess and with an update it can become still better.

You write: "a strongest book need to kill totally" (??),sorry Aung but that is naive and reflect only the

mentality of the players "Rybka against Rybka" in playchess(a competition of killer-moves!).

The HS_for_Rybka.ctg is adapted for all time controls from blitz to long,also for the server (hopefully good...)

Best regards,

Harry

A genuine book is tailored for Rybka for tournaments against all other engines on all possible conditions.

Shebar is a good book for blitz on the server of Playchess and with an update it can become still better.

You write: "a strongest book need to kill totally" (??),sorry Aung but that is naive and reflect only the

mentality of the players "Rybka against Rybka" in playchess(a competition of killer-moves!).

The HS_for_Rybka.ctg is adapted for all time controls from blitz to long,also for the server (hopefully good...)

Best regards,

Harry

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill