Not logged inRybka Chess Community Forum
Up Topic Rybka Support & Discussion / Rybka Discussion / Another possible way to test engine strength?
- - By Weirwindle (***) [us] Date 2012-03-01 05:12 Edited 2012-03-01 05:15
I have found that an engine's strength relates strongly to how "steady" the evaluation is as a game progresses.
As a simple example I will output a string of moves, two possible options per move, and the eval of each option for each move
I will be using only 2 pawns on ranks d and e for simplicity.
The format will be like this

                        Move #      Better Option       Second Option ---------- Current notation

White analysis     1.             e4 0.11               d4 0.05 ----------------- (1. e4 d6 2.)
Black analysis      1.             e4 -0.03              e3 -0.11

A 6 move very theoretical(this may even not be chess) example:

1.      e4 0.33     d4 0.25 --------- (1. e4)
1.      e4 0.21     d4 0.15

1...    d6 0.35     d5 0.90
1...    d6 0.29     d5 0.80 --------- (1. e4 d6)

2.      d3 0.35     e5 -0.15 -------- (1. e4 d6 2. d3)
2.      e5 0.29     d3 0.21    // Black would have blundered here by making e5 if it had been white

2...    d5 0.37     e6 0.51
2...    e6 0.25     d5 0.33 --------- (1. e4 d6 2. d3 e6) // Black miss-evaluates e6 here

3.      d4 0.55     e5 0.00 --------- (1. e4 d6 2. d3 e6 3. d4)
3.      d4 0.45     e5 0.00    // Black see the earlier mistake

3...    d5 0.60     e5 0.60 --------- (1. e4 d6 2. d3 e6 3. d4 d5) // and somehow the game turns into a white win :)
3...    d5 0.51     e5 0.51

White overall stability/precision is (0.35-0.33)+(0.35-0.35)+(0.37-0.35)+(0.55-0.51)+(0.60-0.55) = 0.13 // The lower the number the better
Black overall stability/precision is (0.29-0.21)+(0.29-0.29)+(0.29-0.25)+(0.45-0.25)+(0.51-0.45) = 0.38

Then maybe adjust small trending of score for smaller impact of stability, and hurt stability more for big changes.
This can be done by squaring all the differences.
White engine precision = 0.0004 + 0 + 0.0004 + 0.0016 + 0.0025 = 0.0049
Black engine precision = 0.0064 + 0 + 0.0016 + 0.04 + 0.0036 = 0.0516

Then just add up this for a set of games for better results.
I know this would not work in all situations. I am just putting it out there as a thought.
Parent - - By Razor (****) [gb] Date 2012-03-01 05:46
If I understand you correctly you are referring to how 'stable' the evaluation score is using the same engine as you progress through a game.  Correct?

Assuming the above assumption is correct then I do not see how this measures engine strength; at best it measures evaluation score stability but this is not necessarily very useful as what you really need is the right answer over a stable answer.  Beyond that then yes, the right answer that is stable would be even better!  :smile:
Parent - - By Weirwindle (***) [us] Date 2012-03-01 05:50
As I saw it, the wrong answer will undoubtedly lead to some instability.
And of course there could always be an engine which only outputs 0.00.
Parent - By Razor (****) [gb] Date 2012-03-01 06:00
Nope, sadly not I'm afraid.  You see chess engines are based on search + some form of knowledge and what results is a score that we use to assess how well the engine is doing.  It is not based on some exact formula such as those we use to calculate areas, volumes, etc.  For this reason, a chess engine based on the current approach will not solve chess; it will only ever be a 'best fit' and the symptoms you describe have more to do with search discoveries than degrees of 'correctness' in the answer.  Its a bit like when we go shopping with the wife; trying to find that elusive present.  Every now and then we get to a point where we can sit down and have a coffee.  At this moment You feel a little less frustrated, a little more stable.  Does it help you find the right present, well know.  Does it make you feel much better - YEP! :smile:
Parent - By Uly (Gold) [mx] Date 2012-03-01 08:08
I agree with Razor that this test wouldn't be for strength. For instance, Stockfish is a very unstable engine and I'd expect it to score poorly as in some positions it may have 0.40 swings, all over the place, without a stabilization in sight, yet it'd be a stronger engine than others that are more stable.
- By rocket (***) [se] Date 2012-03-01 08:28
Heres why this test will not reflect anything: There are different types of engines.. some are "slow" and can have rapid horizon effects even during calculations before they make a move.

Others can be very fast tactically and have more stable evals and only shift once in a game that is when they lose.. but be 100 elo below the other engine
Up Topic Rybka Support & Discussion / Rybka Discussion / Another possible way to test engine strength?

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill