I am looking for the "easiest" way to test if a certain engine is "good". I know it is not simple at all, many games are needed for statistical significance and there are philosophical questions on how to test engines.

I am really only interested in a quick way to see if, for example, the latest asmFish is stronger or roughly equally strong as the latest Stockfish development. How would you go about it? I have Aquarium and Arena - any preferences? Is it ok to use Venator's Noomen 2 Ply Book or do I need a "better" opening book. How many games make sense and with what time control etc.

I am really only interested in a quick way to see if, for example, the latest asmFish is stronger or roughly equally strong as the latest Stockfish development. How would you go about it? I have Aquarium and Arena - any preferences? Is it ok to use Venator's Noomen 2 Ply Book or do I need a "better" opening book. How many games make sense and with what time control etc.

There is no quick way to know the answer.

It takes a thousand games to know. And then the answer is only truly dependable for the conditions of the tests (e.g. testing at 40/4 can give you some idea how it will perform at 40/40 but it is not a sure thing).

If you look at the Stockfish testing page, you will see that the Elo boost for various fixes is almost always different at different time controls.

I guess the best answer to your question is to wait for Pohl or CCRL or CEGT to finish their tests on the engine.

If you run a test set, the answer you get will be the answer to the question:

"How well will this engine do on this set of test problems for the time control, thread count, core speed, and memory allocated for the test?"

Now, if an engine does really well in a test, it will give you an indication of strength. But not nearly as good as playing a huge collection of games at the desired conditions.

If you try a quick test (say, 100 games) the answer will be unreliable. You cannot escape the math.

It takes a thousand games to know. And then the answer is only truly dependable for the conditions of the tests (e.g. testing at 40/4 can give you some idea how it will perform at 40/40 but it is not a sure thing).

If you look at the Stockfish testing page, you will see that the Elo boost for various fixes is almost always different at different time controls.

I guess the best answer to your question is to wait for Pohl or CCRL or CEGT to finish their tests on the engine.

If you run a test set, the answer you get will be the answer to the question:

"How well will this engine do on this set of test problems for the time control, thread count, core speed, and memory allocated for the test?"

Now, if an engine does really well in a test, it will give you an indication of strength. But not nearly as good as playing a huge collection of games at the desired conditions.

If you try a quick test (say, 100 games) the answer will be unreliable. You cannot escape the math.

i found that sometimes more than 1000 games are needed to sort out ratings

If the engines are very close in strength, it can be incredibly difficult to find out which one is stronger. And by the time you know the answer, both engines have changed.

agreed, i have found that close related engines can vary by one ore two points as well

makes the rating system look unrefined

makes the rating system look unrefined

the amount of unreliability decreases more slowly than the amont of needed time

depends on how much reliability you really need

and to my experience the testing conditions, time control etc. play a minor role,

the results are usually quite similar

However, when you need to nail it for a few Elo points , as for testing a programming change

of an engine , -say - less than 10 exected Elos difference ,

then you need thousands of games or a ~~corresponding~~ amount of testpositions

OK, so with 1000 games I usually get ~4.5 points average Elo-difference from the real Elo,

for 4000 games 2.2 points, etc. (quadratic) .

The 95% significance level is ~11 points for 1000 games , 3.3 for 10000 (for drawrate 70%)

I think this wasn't yet systematically examined for testposition-sets ?!

But my guess is, that it takes only ~half of the time for a good testset to get to the same significance level.

-------------------

average Elo-difference of 2 consecutive tests at nextchessmove is 2.36 (20000 games)

2.36*sqr((20000/1000))=10.6 which is more than 4.5 , but there were changes between the tested versions

depends on how much reliability you really need

and to my experience the testing conditions, time control etc. play a minor role,

the results are usually quite similar

However, when you need to nail it for a few Elo points , as for testing a programming change

of an engine , -say - less than 10 exected Elos difference ,

then you need thousands of games or a ~~corresponding~~ amount of testpositions

OK, so with 1000 games I usually get ~4.5 points average Elo-difference from the real Elo,

for 4000 games 2.2 points, etc. (quadratic) .

The 95% significance level is ~11 points for 1000 games , 3.3 for 10000 (for drawrate 70%)

I think this wasn't yet systematically examined for testposition-sets ?!

But my guess is, that it takes only ~half of the time for a good testset to get to the same significance level.

-------------------

average Elo-difference of 2 consecutive tests at nextchessmove is 2.36 (20000 games)

2.36*sqr((20000/1000))=10.6 which is more than 4.5 , but there were changes between the tested versions

I think that testing with test positions will show which engines are the best position solvers, as a function of the type of positions we are testing with.

BTW, the most recent SF versions are much better at solving test positions than (say) one year ago.

MUCH better.

BTW, the most recent SF versions are much better at solving test positions than (say) one year ago.

MUCH better.

Understood thanks.

Say, I want to run now a test with 1000 games, I assume I need a suitable opening book with 500 lines

(each engine plays each opening once with White and Black) - where can I find one?

And in terms of settings, the below should work although tablebase adjudication would be nice in Aquarium:

Say, I want to run now a test with 1000 games, I assume I need a suitable opening book with 500 lines

(each engine plays each opening once with White and Black) - where can I find one?

And in terms of settings, the below should work although tablebase adjudication would be nice in Aquarium:

Attached file is both fair and active.

It was collected with the following criteria:

where white_wins > draws and black_wins > draws and abs(ce) between 25 and 55 and abs(round(coef * 444.0,0)) between 25 and 55 and games > 5

What this means is that draws are less than 1/3 of the total outcomes, the score is not drawish but not lopsided, both from an analysis standpoint and also from the actual outcome in games.

It was collected with the following criteria:

where white_wins > draws and black_wins > draws and abs(ce) between 25 and 55 and abs(round(coef * 444.0,0)) between 25 and 55 and games > 5

What this means is that draws are less than 1/3 of the total outcomes, the score is not drawish but not lopsided, both from an analysis standpoint and also from the actual outcome in games.

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill