There are tests to measure the accuracy in centipawn evaluation value?

This argument was discussed in a previous post here in the old forum. (sorry if I also resume it to a my several years old post in CCC forum, where no one seemed interested to report winning probability instead of centipawn score: "too much difficult", they said).

A method could be to take all the "false evaluations" from a database and comparing them to "true evaluations". "False evaluation" (FE) is an evaluation that will be discredited by the engine itself, changing its sign in a certain move forward. "True evaluation" (TE) is where all the evaluations have the same sign up to the end of the game. Naturally all the fluctuating evaluations before the definitive verdict of the "correct sign" in a certain move, they are all "false evaluations".

So the measured accuracy is

average(abs(TE)) / average(abs(FE))

or more correctly, using standard deviations

sqrt(sum(TE²)) / sqrt(sum(FE²))

Note that this measures (especially the all TE average) can give indications about how to normalize the scale of evaluation beetween different engines, as Alkelele said in August:

I would like quantify that roughly...

This argument was discussed in a previous post here in the old forum. (sorry if I also resume it to a my several years old post in CCC forum, where no one seemed interested to report winning probability instead of centipawn score: "too much difficult", they said).

A method could be to take all the "false evaluations" from a database and comparing them to "true evaluations". "False evaluation" (FE) is an evaluation that will be discredited by the engine itself, changing its sign in a certain move forward. "True evaluation" (TE) is where all the evaluations have the same sign up to the end of the game. Naturally all the fluctuating evaluations before the definitive verdict of the "correct sign" in a certain move, they are all "false evaluations".

So the measured accuracy is

average(abs(TE)) / average(abs(FE))

or more correctly, using standard deviations

sqrt(sum(TE²)) / sqrt(sum(FE²))

Note that this measures (especially the all TE average) can give indications about how to normalize the scale of evaluation beetween different engines, as Alkelele said in August:

*Besides the thing masomusic mentions about some engines giving sometimes too high king safety penalties, another thing to keep in mind is that Rybka evaluations are just generally low. For example, +0.50 in Rybka may be*

This is just a question of scale. [...]**roughly**equivalent to +1.00 in Fritz.This is just a question of scale. [...]

I would like quantify that roughly...

You lost me a bit here, but anyway it's very easy to get a centipawns->winning % mapping. Just take all positions with every centipawn value, and see how the engine actually scores with those positions.

Note that you can get Rybka to display her estimates of the winning % by checking the "Win Percentage to Hash Usage" parameter. The winning percentages will be sent as the hash usage values.

Vas

Note that you can get Rybka to display her estimates of the winning % by checking the "Win Percentage to Hash Usage" parameter. The winning percentages will be sent as the hash usage values.

Vas

Yes... It's easy, but ChessBase dont' find it useful to have a normalized scale, considering I can't see it in ChessBase GUI, as you said in the 12 December topic (I've said August before, sorry). Besides no one was interested to send it in the engine pane before you did it very recently with the 2.2n2 version.

However, this winning % (mapping is so helpful and pleasant... do you think so?) is relative to the Rybka evaluation, we don't know how much it is trustworthy (as the author of that topic said). Besides a more absolute winning % is easy to find, using directly your centipawn value in an pre-arranged set of winning % based on material imbalances from a large impartial database (better if from different engines), in a way similar to what Larry Kaufman did in 1999 to calculate the values of the pieces.

My guess is that the method I've set out above (TE/FE, not that last method from a large database), is the most accurate to calculate the trustworthy of the winning %.

Giulio

However, this winning % (mapping is so helpful and pleasant... do you think so?) is relative to the Rybka evaluation, we don't know how much it is trustworthy (as the author of that topic said). Besides a more absolute winning % is easy to find, using directly your centipawn value in an pre-arranged set of winning % based on material imbalances from a large impartial database (better if from different engines), in a way similar to what Larry Kaufman did in 1999 to calculate the values of the pieces.

My guess is that the method I've set out above (TE/FE, not that last method from a large database), is the most accurate to calculate the trustworthy of the winning %.

Giulio

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill