Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / Quick way to test engines
- - By Hamster (**) Date 2019-01-08 17:04
I am looking for the "easiest" way to test if a certain engine is "good". I know it is not simple at all, many games are needed for statistical significance and there are philosophical questions on how to test engines.

I am really only interested in a quick way to see if, for example, the latest asmFish is stronger or roughly equally strong as the latest Stockfish development. How would you go about it? I have Aquarium and Arena - any preferences? Is it ok to use Venator's Noomen 2 Ply Book or do I need a "better" opening book. How many games make sense and with what time control etc. :grin:
Parent - - By user923005 (****) Date 2019-01-08 19:30
There is no quick way to know the answer.
It takes a thousand games to know.  And then the answer is only truly dependable for the conditions of the tests (e.g. testing at 40/4 can give you some idea how it will perform at 40/40 but it is not a sure thing).
If you look at the Stockfish testing page, you will see that the Elo boost for various fixes is almost always different at different time controls.
I guess the best answer to your question is to wait for Pohl or CCRL or CEGT to finish their tests on the engine.

If you run a test set, the answer you get will be the answer to the question:
"How well will this engine do on this set of test problems for the time control, thread count, core speed, and memory allocated for the test?"

Now, if an engine does really well in a test, it will give you an indication of strength.  But not nearly as good as playing a huge collection of games at the desired conditions.

If you try a quick test (say, 100 games) the answer will be unreliable.  You cannot escape the math.
Parent - - By Vegan (****) Date 2019-01-09 00:31
i found that sometimes more than 1000 games are needed to sort out ratings
Parent - - By user923005 (****) Date 2019-01-09 19:32
If the engines are very close in strength, it can be incredibly difficult to find out which one is stronger.  And by the time you know the answer, both engines have changed.
Parent - By InspectorGadget (*****) Date 2019-01-10 19:33
Parent - By Vegan (****) Date 2019-01-13 07:15
agreed, i have found that close related engines can vary by one ore two points as well

makes the rating system look unrefined
Parent - - By gsgs (***) Date 2019-01-09 10:17 Edited 2019-01-09 10:58
the amount of unreliability decreases more slowly than the amont of needed time

depends on how much reliability you really need

and to my experience the testing conditions, time control etc. play a minor role,
the results are usually quite similar

However, when you need to nail it for a few Elo points , as for testing a programming change
of an engine , -say - less than 10 exected Elos difference ,
then you need thousands of games or a ~~corresponding~~ amount of testpositions

OK, so with 1000 games I usually get ~4.5 points average Elo-difference from the real Elo,
for 4000 games 2.2 points, etc. (quadratic) .
The 95% significance level is ~11 points for 1000 games , 3.3 for 10000 (for drawrate 70%)

I think this wasn't yet systematically examined for testposition-sets ?!
But my guess is, that it takes only ~half of the time for a good testset to get to the same significance level.

average Elo-difference of 2 consecutive tests at nextchessmove is 2.36 (20000 games)
2.36*sqr((20000/1000))=10.6 which is more than 4.5 , but there were changes between the tested versions
Parent - By user923005 (****) Date 2019-01-09 19:33
I think that testing with test positions will show which engines are the best position solvers, as a function of the type of positions we are testing with.
BTW, the most recent SF versions are much better at solving test positions than (say) one year ago.
MUCH better.
- - By Hamster (**) Date 2019-01-09 15:09
Understood thanks.

Say, I want to run now a test with 1000 games, I assume I need a suitable opening book with 500 lines
(each engine plays each opening once with White and Black) - where can I find one? :confused:

And in terms of settings, the below should work although tablebase adjudication would be nice in Aquarium:

Parent - By user923005 (****) Date 2019-01-09 20:22
Attached file is both fair and active.
It was collected with the following criteria:

where white_wins > draws and black_wins > draws and abs(ce) between 25 and 55 and abs(round(coef * 444.0,0)) between 25 and 55 and games > 5

What this means is that draws are less than 1/3 of the total outcomes, the score is not drawish but not lopsided, both from an analysis standpoint and also from the actual outcome in games.
Attachment: fair.pgn - Fair and active book (93k)
Attachment: fair.epd - Here is the same thing as EPD records, but also decorated with full analysis (171k)
Up Topic The Rybka Lounge / Computer Chess / Quick way to test engines

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill