I have collected hundreds of games in the CB engine room between fast quads playing Zappa Mexico (not II) and Rybka 2.3.2a at 16-0. Your statement that you don't get a big divergence (I use > 50 cp near 0) between evaluations in most games is correct, but these games are not by any means rare. I'm guessing they happen about 25% of the time.
Of course when these cases do happen, you have to make sure it was really the eval and not one engine out-searching the other.
One very early game example that comes to mind is the Benko gambit, which Rybka seems to understand very well while Zappa just thinks its down a pawn.
Better evaluation is not only clear cases when one evaluate wrong and one evaluate right but other situations when the difference between evaluations of programs is less than 0.1 pawns.
Even if one program is better in majority of the cases when there is a big disagreement it does not mean that it has better evaluation because the second program may be better in evaluating situations when there is a small disagreement.
If program A and B has the same tree but A evaluates move X as 0.01 pawns better than Y and B evaluates move Y as 0.01 pawns better than move X then the choice of X or Y can give different results.
I agree with your assessment that small differences in evaluation may be just as important, or even more so, but I can't see any way to pick these out so I didn't focus on them.
> If we talk about evaluation then my opinion is that in the majority of the game you cannot know if rybka has better evaluation or Zappa has better evaluation.
There are (at least in theory) theoretical techniques for determining (or at least bounding) the quality of an evaluation function. For instance, in backgammon, the evaluation of a position "should" be the weighted average of the evaluations of its 21 direct descendants, which are in turn the evaluations of its descendants, etc. One measure of an evaluator is thus its level of self-consistency. A chess analogy could be the extent to which a score/PV remains stable when depth is increased. A chess-based paper not unrelated to this scheme could be Tuning evaluation functions by maximizing concordance of Gomboc, Buro, and Marsland, though it takes a different tack: it compares evaluations with Informator guesstimates, whereas I might propose an evaluation/search-with-eval comparison.
> It's not obvious that stable score/PV with increased depth is a suitable stand-in for self-consistency.
It is in theory (that is, as depth goes to infinity). :) [In practise, I tend to agree with you - indeed, short-term self-consistency is not necessarily a great metric in backgammon, especially with back games]. Perhaps a better method would be to compare self-consistency (in a probabilistic sense) of 0ply searches to 1ply searches (allowing qsearch, I guess, as it's almost part of the eval in some thinkings) - this should be superior to comparing (say) 15ply to 16ply, where search stability/methodology should be of more importance.
> I suspect that this increased focus on the PV would result in a higher probability of a near optimal move, but maybe a lower probability of finding the optimal move.
This could also have to do with a somewhat different metric for evaluators, namely precision (or effective granularity). In the GBM paper, they put evaluations into 7 bins (white is won, white has a clear advantage, white has a slight advantage, equal, etc.). If you attempt a typical PV search in chess with only 7 possible results from your evaluator, the "hill" you need to climb in order to change the PV should become rather steep, which should lead to more stable PVs. If an engine were to demand that a non-PV move [anywhere in the tree] beat the corresponding PV move by more than the minimal increment, say 0.1 rather than 0.01, then the concern of granularity might not be purely academic. [I don't know if any engines do this to any real extent, though one can detect a similar notion in various types of pruning].
Of course, instead of babbling about it I should just try it with two free engines, but then again, talk is cheap :-)
/* Steinar */
I am not sure what happens in practice.
>Are search and evaluation as independent as people would sometimes have them to be?
Some developers try to separate the two - others would rather have them be more related. For instance, if you evaluate at each node, you can then take the info you obtain (such as king safety and/or static threats) and use it to decide whether to reduce, or whether/how to use null move, or whatever. As an example, Schröder gives cases of this in his description of Rebel (see Search Techniques in REBEL).
Historically, there was also the question of data structures (particularly bitboards), though this has more to do with the interplay between move generation and evaluation than search/evaluation.
Examples would be: can one easily derive desired eval info from (say) bitboards as opposed to a different board representation? Which things should be computed incrementally in move generation (typically PieceSquareValues and Material) rather than being considered in eval? In a few years, a similar [though affecting eval almost exclusively] question might come about: when a fast POPCNT instruction becomes mainstream, will this lead the authors of engines to give consideration to different eval terms that were previously thought to be too costly to compute? [Actually, other than some calculations with mobility, I can't see fast POPCNT being a panacea, but others might be more clever than I].
I thought about dependency that make the same better evaluation to be worse in another program.
In thoeory it is certainly possible and if we talk about practice I do not know.
In practics all the evaluations are dependent on search.
What I meant that I do not know is if we can say if evaluation A is better than B or cannot do it.
The only way to check it is to change program A and give her the evaluation of B and to do the same for B and give her the evaluation of A when the prize is 0(we can make one program artificially slower or play at unequal time control to emulate this situation).
What I do not know is if we are going to find that one program has better evaluation or not.
If we find that evaluation of program B make A better when we ignore the time factor than B has better evaluation than A.
If we find that evaluation of program A make B better than A has better evaluation than B.
If we find no this and no that then it may be unclear which program has better evaluation.
In Rybka, we tend to go through phases.
In late 2005, I went through a big eval phase. I think that this was the strongest point of Rybka 1.0 - although Larry seems to disagree with me about it :)
Throughout 2006, I basically just worked on search. This culminated with Rybka 2.2n2 and WinFinder.
When Larry joined me in 2007, we went back and worked on the eval again, under a completely different philosophy. The Rybka 2.3.X versions were a sort of early prototype of this method. I was quite happy with these steps and now they've been taken much further, to the point that Rybka is nearly unrecognizable.
>"Finally, Rybka is better than Crafty because Vas has implemented something new and interesting that I have not yet discovered. Nothing more, nothing less." ----Robert Hyatt
Minimally, I would append an additional, indeed quite important, codicil: "... that I have not yet discovered and implemented." :)
[Maybe RH is espousing the Socratic notion - knowledge is virtue (or: once we know virtue, we will be virtuous) - while I propose the Aristotelian modification].
In fact, I've never even profiled Rybka and spend zero time on optimization. This should be quite obvious from the sources - there are no unrolled loops or other arcane constructions, assembly sequences which don't map to C, etc.
It just doesn't seem like a productive area, especially long-term.
A few comments for any bitboard fanatics who might be browsing here:
1) I typically put elegance and simplicity before speed. Don't look for too much meaning at the low level. Someone like Gerd Isenberg could probably speed Rybka up by 10-15% without crossing over into any really hard-core optimization.
2) I've always just used plain Crafty-style rotated bitboards, and haven't yet managed to try anything else. My intuition is that the magic number approach (which wasn't around when I started) would be a little bit better. This may be doubly true for Rybka, as I suspect that she pollutes the cache more than most engines. If I started today, this is probably what I'd go with.
Of course 10% that you may get from optimization is not important.
idiot. It's such a rare experience for me that I'm not even angry,
just amazed. Somehow, without ever caring about execution speed, you
have written an engine that searches more nodes per second than *any
other commercial engine*, including Fritz, the engine written by the
man you claim is an assembler god. A casual observer might be quite
confused by this, but someone who read Strelka's source code somewhat
less so, because it contains hundreds of examples of ugly code chosen
because it runs damn fast. Some easy ones:
Strelka's search contains completely separate routines for PV and
non-PV nodes, which requires writing most search code twice, but saves
a few brances and cycles. This could be done by #including C files,
but this is also ugly.
Strelka's noncapture move generator scores each move from the history
table, as opposed to having a separate loop, which is what everyone
else does. This is faster, since there are probably free slots in the
pipeline during move generation, but it is extremely ugly, since any
change to history code must be done in 15 different places, once for
each piece type.
Strelka's static exchange evaluator doesn't compute the actual score,
but only whether or not the capture is losing or not. Of course, it
is lightning fast since there are millions of early termination cases.
But compare it to Crafty's swap() function and then talk to me about
elegance and simplicity.
I could go on all day, but I think I've made my point. Against this
mountain of empirical evidence, we have your claim that "you don't
program in assembly, and therefore you do no optimization". Either
you are using a different set of definitions from the rest of the
world, or you are so desparate to maintain the reputation of Rybka as
a high-knowledge program that you first obfuscated its output and now
are attempting to maintain that illusion despite the previously
mentioned mountain of evidence to the contrary. Mr Rajlich, if that
is your real name, let me remind you of an old quote: You can fool
some people all the time, and some people all of the time, but you
can't fool everyone all the time. The jig is up, and while your
pitiful efforts to maintain the facade may convince people who don't
read C++, those of us who do are only lowering our opinion of your
integrity even further.
P.S. I have never understood why it's important that Rybka is a high
knowledge program anyway. If it works, who cares?
So I consider my reputation, for honesty at least, to be safe.
From a non-programmer point of view though there is just one thing I don't understand: If Rybka is a high-speed, low-knowledge program then what is Larry doing?? Surely he'd practically be redundant...
It is highly likely, in fact almost certain, that whatever is on Mr. Rajlich's computer now is very different.
> whatever is on Mr. Rajlich's computer now is very different.
This is one "human" reason which makes the VR story reasonable (he mentioned it in passing himself) - optimisation is something that is typically done at an endpoint [having run out of new ideas to try, for instance], not in the mid-stages of development. Of course, Rybka 1.0 might have been an "endpoint" just as much as a "starting point" from what we can tell.
high speed does not mean low knowledge and I do not consider strelkato be a low knowledge program.
Strelka is using more than an hundred of constants in the evaluation even without the material table and Larry explained that part of his job is to find better values for the numbers.
Note that strelka code has a lot of magic numbers and I corrected names to significant variables in my private version of code
In eval.c after modification
I can easily count 7 constant names for pawn structure and 2 array of 8 variables for candidate pawns in the part that is going to pawn hash
It gives practically 7+2*6=19 constants for Larry to modify.
2)7 arrays for passed pawns scoring in the opening and the endgame give 42 values to tune
3)6 constants for king attack 8 constants for mobility
4)11 constants for different tasks
5)16 variables for king attack weight
6)The following 16 constants are part of the code of strelka and are used to calculate the piece square table
const __int16 PawnFileOpening = 181;
const __int16 PawnFileEndgame = -97;
const __int16 KnightCentrOpening = 347;
const __int16 KnightCentrEndgame = 56;
const __int16 KnightRankOpening = 358;
const __int16 KnightTrapped = 3200;
const __int16 BishopCentrOpening = 147;
const __int16 BishopCentrEndgame = 49;
const __int16 BishopBackRankOpening = 251;
const __int16 BishopDiagonalOpening = 378;
const __int16 RookFileOpening = 104;
const __int16 QueenCentrOpening = 98;
const __int16 QueenCentrEndgame = 108;
const __int16 QueenBackRankOpening = 201;
const __int16 KingFileOpening = 469;
const __int16 KingCentrEndgame = 401;
There are more constants and I already counted more than 100 constants in this post.
The job of finding the best values for the numbers is not an easy job and I believe that nobody can find the best values when even larry
cannot do it and the best that he can do is finding better values than the existing values.
Do not believe people who tell you about low knowledge programs and all the top chess programs have a lot of knowledge in their evaluation.
Strelka may have less knowledge relative to other programs (I am not sure about it because I did not see the code of Fritz Junior or Hiarcs) but it does not mean that strelka is a low knowledge program.
If you are interested in low knowledge programs then I suggest that you use only 10 numbers in your program to calculate the evaluation
(it is possible to do it and only material evaluation use only 5 numbers for the value of the pieces).
Note that even in the hypotethical case of 10 numbers it is not trivial to find the best values but this case does not happen and practically we have 2 type of programs:
1)high knowledge programs
2)very high knowledge programs.
The evaluation of the programs of type 2 is not always better than type 1 because if for example you have 100,000 constants to optimize then it is easier to get wrong values for the constants.
> Larry explained that part of his job is to find better values for the numbers.
I think another part of LK's job is to find new things to measure, and I would say that this is probably of more importance. Diddling with numbers can be done to some extent via (say) learning techniques [albeit with slow convergence in many examples], but finding new "positional features" (to borrow a word from neural nets) is more likely to require human input. It is also not clear to me that simply the number of constants (quantity as opposed to quality) is a useful metric for judging knowledge.
From time to time you have made post like this, that adding a particular term to the evaluation resulted in a big impact.
Is it possible, if you have time, to post a pair of positions in which this had happened, with the evaluation by the new and the old rybkas?
Or if it is top secret right now, when the release date arrives will you do it? Something similar to this
We'll do our best to describe everything in general terms before Rybka 3 is released.
Ok Larry, all is top secret, but please give us some hints:
1. What is a big boost for a new term in your mind?
2. Hwo do you verify the boost? By a number of positions prepared for the new term or by 36,000 games with 1" per game or by other means?
3. How many terms do you have in Rybka now approximately?
4. Did you ever delete terms in the past, because they didn´t work or trust you always in the future for better values?
2. Mostly by something like your 36,000 1" games, but we also make sure that the gains are still there at slow speeds (like maybe 5" per game!). Of course it is possible that something may help at those speeds but not at game/5' or slower, but this rarely if ever seems to happen. Only the Elo gain may be less.
3. Somewhere in the ballpark of a thousand, though some of those are not currently treated as independent terms so one could also say just a few hundred.
4. Often Vas gives me a version with new terms I request or he proposes, but if I can't prove their value after several tries at setting values he will not include them in the real Rybka program. They usually remain in my versions for a long time in case I want to try again.
1. Wow!! This is really a boost! I´m not so convinced about your cutback (CCRL/CEGT) I have seen already earlier in your post. That´s not only enthusiasm from my side. We will see ..
2. I think, eval helps more on long distance. Not all eval terms can be covered by search!
3. Have you a clue about interdependences of your (independent) terms? I believe, it´s a main claim to restrain terms from each other! If you aren´t alert, you wouldn´t know, what you are testing. Maybe less is more!
4. I only asked this because of point 3. If you have hundred of terms, can you say me, which term is decisive for eval of rook on line 7? Only one term, or three, or maybe 8? And you know the interdependences?
Oh I see, it´s a hard business!
Just one comment. I propose a better word than interdependences. I think it would be better to call correlation.
If some terms are correlated among themselves the engine could end up over/under estimating a position. Because some weights are redundant on more than one term.
Just an opinion.
What we currently do is very primitive and very human-based. Larry manages by hand tons of interrelated and "dirty" chess terms, using his own chess intuitions, which were learned the hard way.
We do have some ideas for automating some of these processes and making them more scientific. This is for Rybka 4 and beyond. In theory, some general techniques could be developed which could then be applied to other games which are 'similar to chess'.
As usual, the future is a lot more exciting than the reality :)
But can you say us what books do you consider specially useful? For example for endgames:
Dvoretsky's Endgame Manual and Muller and Lamprecht Fundamental Chess Endings?
Anyhow, something like the fritz 9 engine information page should not harm rybka secrets. Positions from some historical games, where the new rybka evaluation is better. When you release her, should be some nice publicity.
> But knowing which terms are important is useful for our competitors, because they would then spend time refining the corresponding term (if any) in their own programs.
And what's bad about it? You could get a stronger opposition to show how good Rybka really is (and I think that would be good for computer chess in general.)
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill