Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / The level of play in CCRL and FIDE rating systems
- - By deka (****) [ee] Date 2013-08-22 16:22 Edited 2013-08-22 16:24
I wrote a paper whose aim is to show relationship between CCRL and FIDE rating systems from point of view on absolute playing strength. It also includes conversion tables.
http://www.chessanalysis.ee/CCRL%20vs%20FIDE.pdf

Two most important points concluded from this little study:

1. The relationships between the accuracy of play and rating are completely opposite for either types of players and their respective rating systems.
2. Modern desktop computers are not yet capable of consistently playing on 3100-3200+ level.
Parent - - By rocket (***) [se] Date 2013-08-22 18:15
The built in contempt setting of most engines, far exceeds humans. Engines in their default tuning, are always set to play for the win, which could be a reason as to why the best engines are so highly rated. Wheres someone like Kasparov had to be contempt(hehe) with drawing against safe players.

However, I suspect the explanation lies elsewhere, but that's one major difference. The actual number of games were contempt setting played a role are probably insignificant.
Parent - - By deka (****) [ee] Date 2013-08-22 21:27
I don' think this theory is waterproof either. The problem with 'playing for win' is that it doesn't guarantee better results. More aggressive play is usually accompanied by making suboptimal moves, decresing the overall quality of play.
Parent - By rocket (***) [se] Date 2013-08-22 22:19 Edited 2013-08-22 22:28
The point being that it works in favour of the superior engine, 9 times out of 10, which inflates it's rating. This is a highly speculative theory.
Parent - - By mindbreaker (****) [us] Date 2013-08-25 05:08
Garbage. Sorry, to be so blunt.  "Rybka" (Rybka 3 Default, given in the previous paper) and  5 minutes on "5 simultaneous PV-s on a dual 2GHz Athlon" is supposed to be your measuring stick of perfection.  That completely guts any chance of finding out anything.  Only errors above one pawn should be considered errors at all and even then it would still be very shaky by such measures.  And your computer candidates: "Micro-Max 4.8, Waxman 2008, Philou 2.8.0, Crafty 23.0 and Hiarcs 12.1"  ludicrous.  Are any of those capable of reliably playing 3100-3200? Probably not, the strongest is Hiarcs 12.1 at 2913.  Still, vs a human, that Hiarcs would kill any top player with a quality opening book using modern hardware.  Mostly because positionally it is not terrible, and tactically it is far higher than humans.  The aggressive style would also knock most humans off balance.

Further, Human ratings and computer ratings are not comparable because the styles are so different.  The computers are about a thousand points higher in tactics and a hundred to three hundred points lower on positional aspects at the top.  The positional weakness mostly stemming from applying general positional ideas to all positions rather than varying by opening and line where values change somewhat between these, the fact that combinations of material have different collective values than the sum of their individual values, engines don't know what material to hang on to, trading where they could have continued to squeeze and get more rather than taking the first smidgen of gain by a trade that presents itself, and simply not knowing what endgames they should trade into often squandering winning chances.  And there are a few other positional weaknesses: They can't see futility (impossibility to progress), inevitability (there is a way in but deep), no way in, and fortress positions/locked pawn chains.  There is quite a bit of overlap here.  It is mostly that there are millions of possible lines but key positions that can or cannot be achieved which decide an outcome. There could be millions of transpositions into them, or it could try everything and it would not matter...the city gates are guarded.

Differences in human outcomes against various engines is almost entirely a function of the positional strength or lack thereof of the engines, humans simply have to avoid tactics and sharp positions as best as they can because any engine above 2600 would kill them there and many even lower than that.  This means that the difference in tactical strength between one engine and another would not affect a ratings list of engines if those engines only played humans, and the hierarchy of engines on that list would be very different than engine vs engine ratings lists.  This undermines any attempt to compare the FIDE or any other human list with CCRL or any other computer list. 

Of additional concern is that humans very in strength year to year, month to month, even day to day, and they play very few games.  The margin of error in the human lists is probably massive as a result. You never know what you are going to get when humans show up to a tournament.  And with so few games played the bottom finisher could actually be playing the strongest and just have been profoundly unlucky.

There are other problems with the test as well.  For example an engine can have personality...a bias against some moves that is irrational.  Three or more moves may be of very close to equal merit, yet an engine may score them quite different.  It does not hurt the engine in tournaments as long as it chooses one of the three but if you judge another engine based on the score of those other two options, you are transferring this bias.
Parent - - By deka (****) [ee] Date 2013-08-25 09:15 Edited 2013-08-25 09:19

>Garbage. Sorry, to be so blunt.


Don't worry about that, blunt people with little comprehending skills usually tend to be blunt.:smile: no surprises here, as evidenced by your post.

>"Rybka" (Rybka 3 Default, given in the previous paper) and  5 minutes on "5 simultaneous PV-s on a dual 2GHz Athlon" is supposed to be your measuring stick of perfection.


It was never meant to be perfect. What really matter is the fact that the engine and hardware symbiosis used for analyzing games is stronger than any of players and engines analyzed. My hardware is by far superior than what's used in CCRL games.

>That completely guts any chance of finding out anything.


Why not? Can you provide a constructive explanation?

>Only errors above one pawn should be considered errors at all


Not true, all errors should be counted in, what small errors lack in significance, they make up for being far more frequent.

>Are any of those capable of reliably playing 3100-3200?


No, and never implied so. The point for choosing these engines was to find a relationship between accuracy of play and CCRL ratings. If you had been more attentive you surely would have noticed how closely the actual curve and exponential trend curve fit together. It means we can quite reliably assume that Houdini or Komodo wouldn't be able to play consistently over 3200+ FIDE level even on modern hardware.

>Further, Human ratings and computer ratings are not comparable because the styles are so different.


And this fact is precisely the reason I undertook such research and makes my paper interesting. The results indeed confirm that the two systems are fundamentaly different.

>The computers are about a thousand points higher in tactics and a hundred to three hundred points lower on positional aspects at the top.


What's the basis for such claims? Moreover is contradicts findings in the study. If the relative importance on tactics and strategy were equal, then being 1000 stronger in tactics and 300 weaker in positional chess results in being 700 elo stronger in all. This is obviously not true, and as a matter of fact, tactics is more important in chess.

>For example an engine can have personality...a bias against some moves that is irrational.


I believe consistency and the credibility of analysis outweighs this. If I had used, say, Critter instead, the differences would have been quite insignificant and barely noticeable.

The rest of your post doesn't contain anything new and irrelevant to the topic.
Parent - - By mindbreaker (****) [us] Date 2013-08-31 07:35
I am sure I could have presented this with a less negative first sentence which has obviously blinded you to the merits of my arguments. I apologize for that.

"little comprehending skills" ad hominem.  I am not offended, and I did set a bad tone, and you are invested in your effort, but ad hominem is not useful in reasoning if you actually care about facts. The merit of and argument should be intrinsic rather than given arbitrary weight based on the presenter. 

And, no, I am not saying anything that should be considered "new" or at least nothing that would not be considered common sense to someone familiar with chess engines.  The opposite!  What I am saying is obvious.

"Never meant to be perfect" you are treating it as though it is.  Any deviation from the first choice incurs an error.  You add those up and apply a bit of math.

"1.Find out the actual average error. It is calculated by taking the eval of best move by Rybka and substracting(sic) the evaluation of the move made by a player from it."

When Rybka selects a move different than a weaker engine that does not make that move any better.  Chess is an incomplete knowledge game, ultimately the engines are just guessing until reaching a position they can solve.  Even then they may not see the quickest way, just a way.

A stronger engine does not make stronger moves than a weaker one every move it deviates.  If fact, a weaker engine could make 10 out of 10 better moves in a row but the 11th could undo it all. Your measurement would dock it for all 11.

The higher rating for Rybka from performances against other engines could be entirely a result of not making large errors and detecting most of the opponent's large errors.

All, you are finding is how similar engines and people are to Rybka 3, in 5 minutes, 5 lines, on particular hardware with a particular amount of RAM in one particular run (or whatever your methodology of getting Rybka's opinion).  You are not finding relative move strength.

By your own measures, if you ran Rybka itself a second time the same way it was run initially, it would score below itself.  Because it would not do the same thing every time.  Also, if you ran an engine known to be better such as Houdini 3, it would score below Rybka by your methodology. 

"No, and never implied so. The point for choosing these engines was to find a relationship between accuracy of play and CCRL ratings. If you had been more attentive you surely would have noticed how closely the actual curve and exponential trend curve fit together. It means we can quite reliably assume that Houdini or Komodo wouldn't be able to play consistently over 3200+ FIDE level even on modern hardware."

Your measure of accuracy was Rybka 3...that simply undermines everything...it is simple and obvious.  The difference in shape you got should have been a clue as your measuring stick of "accuracy" was also a computer and would have been subject to all the same effects you attributed to the other engines.

    >Only errors above one pawn should be considered errors at all

"Not true, all errors should be counted in, what small errors lack in significance, they make up for being far more frequent."

You are not following the argument...you have no grounds to say the disparities between first and second or third are anything other than engine bias if they are small.  Large errors, provide far better certainty that a mistake was actually made.

I predict that if run Houdini 3 games and other strong engine games through your test, that it will turn your graph into a u-shaped graph with the best "accuracy" as you measure it, near Rybka 3 strength.

As for the hardware, I went with what was stated in the paper your paper linked to.  You claimed only minor modification.  To not give your differing procedure forces one to go with what is presented.  Also failure to present your actual procedure invites considerable doubt not to mention being a blatant violation of everything scientific. All, scientific testing includes what is needed to repeat the tests, or it is politics, opinion, a pulpit, or fraud, not science. And just saying "Rybka" appears to be hiding that you might not be using the most up to date version, which if you are following the old procedure would be true...Rybka 3.  Appearing to conceal that, does not engender credibility.

"What's the basis for such claims? Moreover is contradicts findings in the study. If the relative importance on tactics and strategy were equal, then being 1000 stronger in tactics and 300 weaker in positional chess results in being 700 Elo stronger in all. This is obviously not true, and as a matter of fact, tactics is more important in chess."

1000-300=700; math may work that way, but chess Elo does not.  We are often only as strong as our weakest link.  If you know your opponent's weak areas, you can change your approach.  Chess players dodge tactical situations because they know they can't survive sharp positions against engines.  They have several ways of trying to do this, some have more success than others.  If they can get a very positional game, they have a reasonable shot.  Chess tends to be more tactical than positional in general maybe 65%-35% but each game has both in differing amounts.  If humans thought they were playing other humans, but it was a top level engine instead, they would get obliterated (especially if their style was more tactical), and would also recognize their enemy as a machine very quickly, and just give up or get very angry.
Parent - By MarshallArts (***) [us] Date 2013-08-31 08:42
"1000-300=700; math may work that way, but chess Elo does not.  We are often only as strong as our weakest link.  If you know your opponent's weak areas, you can change your approach.  Chess players dodge tactical situations because they know they can't survive sharp positions against engines.  They have several ways of trying to do this, some have more success than others.  If they can get a very positional game, they have a reasonable shot.  Chess tends to be more tactical than positional in general maybe 65%-35% but each game has both in differing amounts.  If humans thought they were playing other humans, but it was a top level engine instead, they would get obliterated (especially if their style was more tactical), and would also recognize their enemy as a machine very quickly, and just give up or get very angry."

Excellent thoughts, mindbreaker. I remember how I'd often play a great game for 10-20 moves and then ruin it all by making a bad move that my strong opponent could exploit, and that would define my results. So we are only as strong as our weakest link!

Furthermore, a strong human with enough time to analyze and allowed to consult chess engines so he does not blunder, will actually defeat the (unaided) strongest programs out there. (Hint: correspondence chess where computer assistance is allowed). This too supports your argument, since, without the human tactical errors, the engines will have no inherent advantage over humans. Their planning capabilities and strategic reasoning, if we can call it that, is still limited. You can neutralize their tactical superiority, by consulting equally strong tacticians, and they're suddenly an inferior opponent.

In addition, as you suggested, humans can steer the game towards more quiet positions, where tactics are less prevalent and less decisive. In fact, the commonly held notion that chess is "99% tactics" also comes into doubt. It is rather a misleading misconception.

--
Parent - - By deka (****) [ee] Date 2013-08-31 09:09

>has obviously blinded you to the merits of my arguments


I wasn't blinded, it was easy to see how wrong you were and how little your understanding about the subject was.

>"little comprehending skills" ad hominem.


Wasn't meant to be 'ad hominem' despite your wish to see this in such light. I was an obvious and honest reply to your attempts to belittle my work, although somewhat harsh.

>you are treating it as though it is


No, I'm not. Whede did you get that idea? Rybka 3 on my hardware is much stronger than other engines I analyzed, thats all there is. We cannot have any perfect engines so we must be satisfied with it.

>When Rybka selects a move different than a weaker engine that does not make that move any better.


Stronger engines on stronger hardware tend make better choices on average. I never expected it to give better evaluation on every single move.

>The higher rating for Rybka from performances against other engines could be entirely a result of not making large errors and detecting most of the opponent's large errors.


As far as I know what makes one engine better than another one are eval function and search algorithm.

>You are not finding relative move strength.


This is surprising. When one plays better chess, he can evaluate better and find better moves. Therefore we can use it to estimate moves of a weaker chessplayer.

>By your own measures, if you ran Rybka itself a second time the same way it was run initially, it would score below itself.


Yes, of course, multi-core engines are non-deterministic. But the difference would be insignificantly small. The average error of a paricular player could be 0.123 isntead of 0.120, for example. But the differences are very small and hence have no effect on the credibility of the engine.

>Also, if you ran an engine known to be better such as Houdini 3, it would score below Rybka by your methodology. 


As I stated earlier, I'd never use a weaker engine to analyze stronger engines.

>Your measure of accuracy was Rybka 3...that simply undermines everything...it is simple and obvious.


How does that undermine and how is it simple and obvious?

>The difference in shape you got should have been a clue as your measuring stick of "accuracy" was also a computer and would have been subject to all the same effects you attributed to the other engines.


What same effects? On what basis sould it have been a clue?

>You are not following the argument...you have no grounds to say the disparities between first and second or third are anything other than engine bias if they are small.


If it were true there'd be no correlation between accuracy of play and the strength of play. The biggest source of inaccurate play is not large errors, but errors around 0.20.

>To not give your differing procedure forces one to go with what is presented.  Also failure to present your actual procedure invites considerable doubt not to mention being a >blatant violation of everything scientific. 


The two differences are:
1) the maximum error possible is increased from 2.00 to 4.00.
2) too obvious moves that satisfy the following conditions simultaneously are excluded:
a) the difference between 2 best moves is larger than 1.00 across all depths
b) the best move remains the same across all depths
c) a player has selected the best move.

>All, scientific testing includes what is needed to repeat the tests, or it is politics, opinion, a pulpit, or fraud, not science.


Everything needed for reproducing tests is described in my study.

>And just saying "Rybka" appears to be hiding that you might not be using the most up to date version, which if you are following the old procedure would be true...Rybka 3.


Or I just forgot to write '3' I started data collecting ana analysis in 2009 before Rybka 4 came out.

>1000-300=700; math may work that way, but chess Elo does not.  We are often only as strong as our weakest link.  If you know your opponent's weak areas, you can change your approach.


What you rote is obviously correct, but it's not easy to predict and take into account hypothetical changes of approach. Be it humans using anti-computer strategy or Morphy playing stronger if he had played against players equal to him.
Parent - - By Nelson Hernandez (Gold) [us] Date 2013-08-31 23:35 Edited 2013-08-31 23:37
For what it's worth, I thought your paper was a worthy attempt to compare two disparate rating systems.  If perfection were the standard for informal papers such as this then nobody would ever learn anything from analysis that was methodologically imperfect but directionally-correct.

Mindbreaker's critiques, however harshly delivered, are in most cases valid but at the same time miss the point of your paper. 

The question is: how would (let's say) Carlsen do against Houdini 3 on a strong machine?  I think we all intuitively agree that he would have a very hard time winning games unless he successfully implemented some extreme anti-computer strategy--which he might well do.  But what is much less clear is whether he could draw a fair number of games at will.  Personally, I think he could, especially with white pieces.  Further clouding the picture is what kind of opening book Houdini might employ.  A pedestrian one might easily get itself caught up in a drawish book-exits which would allow a clever human to rack up half-points by exchanging into an inactive endgame or by deliberately pursuing move-repetitions.  Conversely, a book designed to exit into complicated positions (to make this conceptually simple, let's say positions with many legal and viable moves but no obvious ones) would undoubtedly tend to help the machine.

I think the key element in a human's favor is their resourcefulness.  A human understands a computer's strengths and limitations while a computer has no such situational awareness.  Thus, if the goal were to achieve the highest possible score in a match, a top GM might well make no attempt to win any game and pursue a draw every time, whether through an ugly anti-computer pawn blockade or perhaps through a conventional yet drawish opening that featured solid pawn structures.  I think it is not inconceivable that Carlsen, well-prepared and trying his best to draw every game, could score as much as 30% against Houdini.  Which, I think, supports your general conclusion.
Parent - - By deka (****) [ee] Date 2013-09-01 08:14
Mindbreakers critique mostly revolves around whether engines are usable for determining the accuracy of play. But it is already proven my many authors that it's a perfectly valid method.

>The question is: how would (let's say) Carlsen do against Houdini 3 on a strong machine?


It's impossible to answer at this point, because one'd have to factor in how much strength humans gain and engines lose when the former employs anti-computer strategy. I generally agree with you. Humans have intelligence to adapt their play to conditions, but engines dumbly play in the same style regardless of their opponent.

Carlsen's 2862 corresponds to ca 2700 CCRL, but it must be stressed that CCRL uses ca 3x shorter time controls and 8 yrs old hardware.
Parent - - By Graham Banks (*****) [nz] Date 2013-09-02 20:05

>........ it must be stressed that CCRL uses ca 3x shorter time controls and 8 yrs old hardware.


We benchmark to that particular hardware.
The hardware that we actually use varies. We don't all use the same hardware.
Parent - By deka (****) [ee] Date 2013-09-02 22:07
I see, thanks.
Parent - - By Uri Blass (*****) [il] Date 2013-08-31 13:46
I see no way to measure tactical rating and positional rating so the following sentence means nothing:

"The computers are about a thousand points higher in tactics and a hundred to three hundred points lower on positional aspects at the top.  "
Parent - By MarshallArts (***) [us] Date 2013-08-31 22:22 Edited 2013-08-31 22:25
It's true that it is hard to exactly measure, or even approximate, tactical and positional ratings apart from each other, but the essence of what he's saying in this regard is correct, that the engines have begun to close the positional gap somewhat, but they still get by primarily on tactics.
Parent - - By mindbreaker (****) [us] Date 2013-09-02 18:57
How about this for the tactics. Take all the problems of the latest Chess Solving Championship http://www.saunalahti.fi/~stniekat/pccc/solving.htm  and see how fast Houdini 3 solves them.  We know absolute strength reduces with time at about 80 Elo each time the time permitted is halved.  So, that can be used to adjust the time.  The best human time was 334 minutes: http://www.saunalahti.fi/~stniekat/pccc/wcsc12.htm  If computers are 1000 Elo better then it should solve a similar number of problems in .06 seconds.  As these were human experts at solving, and knew they were solving, it would only be fair to use Houdini 3 Tactical or something.

Maybe it is 1000 Elo or maybe it isn't.  But that is the outline for a fair test.  Not quite fair to the computers, but it should be interesting.
Parent - By mindbreaker (****) [us] Date 2013-09-02 19:24
Hmm, they have some help mates and self mates.  I guess, I have to will leave those out.  Round 1: .01 seconds on antiquated computer. Humans were given 2 hours.
Parent - By mindbreaker (****) [us] Date 2013-09-02 19:42
Round 2 complete: .01 seconds. Total .02 seconds.  That gives it about 1120 Elo better than the humans.
Parent - - By mindbreaker (****) [us] Date 2013-09-02 19:45
I think I must have misunderstood the time for the humans, as 334 minutes is longer than 4 hours.
Parent - By mindbreaker (****) [us] Date 2013-09-02 19:56
I used the wrong problems. That was another championship.  Probably does not matter.  It really only needed to take less than .48 seconds as my computer a Q6600 is about 8 times slower than a current strong desktop size server and it only took .02 seconds, and the humans made errors anyway.  1,000 Elo better is probably an underestimate.
Parent - By mindbreaker (****) [us] Date 2013-09-02 20:18
I wish I could have made a more precise measurement/test.  I can't find any solving championships where it gives the amount of time used for each problem, rather than just the total time used.  I don't know how much time was used for the help mates and self mates relative to the normal problems, and I don't think Houdini 3 is configured to solve those.
Parent - By mindbreaker (****) [us] Date 2013-09-02 21:03
I used Fritz GUI and just used new position setup to set up the positions. Houdini 3 Tactical with 16 MB Hashtable. The problem sets were given at http://www.saunalahti.fi/~stniekat/pccc/isc12p.htm  Problem 3 and 9 took .1 seconds each.  The others were unmeasurably fast. The last 2 problems of each section are not chess exactly.  All the others were solved in a total of .02 seconds.  You could argue those fractions of a hundredth of a second add up, but most had several plys after them which were also still at .00 seconds.

1. 0.00 D 11/19
2. 0.00 D 15/22
3. 0.01 D 14/45
4. 0.00 D 9/19

7. 0.00 D 11/16
8. 0.00 D 8/27
9. 0.01 D 20/37
10. 0.00 D 6/20 !
- By deka (****) [ee] Date 2013-09-02 22:32
Let's talk about reasons why calculation-based play results in the exponential relationship between the accuracy of play and the strength of play, and knowledge-based play has the inverse relationship i. e. logarithmic. Anyone have any ideas, thougths, theories?

Suppose we constructed a kind of an artifical and hypothetical chess-playing entity, whose play has equal amount of calculation and knowledge. Would it have a linear relationship between the level of play and the accuracy of play?:smile:
Up Topic The Rybka Lounge / Computer Chess / The level of play in CCRL and FIDE rating systems

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill