Topic Rybka Support & Discussion / Rybka Discussion / Parameters Experiment 38: 64 Elo over R4 default
My latest batch of Experiments is a little over 400 games in per engine. One is showing very good results and has won 3 of 4 gauntlets and came in a close second in the forth.
They only play against non-R4s, just as ratings are determined in most ratings charts.
I am getting 3357 Elo based on 406 games where Rybka 4 is 3293 Elo based on 1986 games.
It is of course very preliminary, I expect to run the engines 900-1200 games each.
So far I have not seen any indication that I am running out of room to improve the parameters.
The current batch of 9 has completed 100 game matches with: StockFish 1.8, Deep Fritz 11, Critter .8, and an unnameable opponent. The Current opponent is Zappa Mexico II in a gauntlet which has just gotten under way.
I plan to give an update at 800 games in. And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results.
They only play against non-R4s, just as ratings are determined in most ratings charts.
I am getting 3357 Elo based on 406 games where Rybka 4 is 3293 Elo based on 1986 games.
It is of course very preliminary, I expect to run the engines 900-1200 games each.
So far I have not seen any indication that I am running out of room to improve the parameters.
The current batch of 9 has completed 100 game matches with: StockFish 1.8, Deep Fritz 11, Critter .8, and an unnameable opponent. The Current opponent is Zappa Mexico II in a gauntlet which has just gotten under way.
I plan to give an update at 800 games in. And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results.
Looking forward to it, you could be finding new Rybka 4.1 defaults!
I don't think it will become the default though it might be good as an included personality. I suspect R4 was tuned against other variants of itself and is probably optimized to get the best results against other variants of R4. If that is the case my variant could easily loose to it head to head.
What it should be good at is tournaments were there will only be one Rybka 4 or for ratings charts.
I may try for a match version as well that will be very good at not loosing but at the cost of more draws. But I suspect 4.1 will arrive before I get to it.
What it should be good at is tournaments were there will only be one Rybka 4 or for ratings charts.
I may try for a match version as well that will be very good at not loosing but at the cost of more draws. But I suspect 4.1 will arrive before I get to it.
> But I suspect 4.1 will arrive before I get to it.
Or not, Vas is still nowhere to be found (eh, virtually-virtually, in the real-virtual world I could send him an email and he'd answer, but I don't want to bug him).
You don't want to bug him.

> But I suspect 4.1 will arrive before I get to it.
You believe everything?
I believe it only when I see it.
Yes, I have noticed this myself in (C) tournaments where you have multiple personalities who all score slightly better or worse than default then having 1 particular personality doing well against those variants but worse against default itself when head/head. very annoying... and back to drawing board :)
probably i'm wrong (because i think that simply deep inside myself can't belive you get 60+ elo over default) but i have the feeling you tested too much variants...
let's suppose you want to test all the engines in this world (obviously i'll exaggerate) by doing a ...let's say... 32-games match agaisnt rybka.
after testing thousands of engines i wound't suprise to see anywhere in the huge table something like Booot 4.15 - rybka 4 19.0 - 13.0
how wrong am i? could you test more the best variant?
let's suppose you want to test all the engines in this world (obviously i'll exaggerate) by doing a ...let's say... 32-games match agaisnt rybka.
after testing thousands of engines i wound't suprise to see anywhere in the huge table something like Booot 4.15 - rybka 4 19.0 - 13.0

how wrong am i? could you test more the best variant?
You need to go to the last page (Tournament of the Champions). Indeed, the best of the variants appear closer to 30 Elo stronger rather than 60. It however is ridiculous to compare my tests to 32-game matches with one opponent. Most of the variants play over 800 games total against 10 or more opponents. As for how the tests pointed at 60+ Elo gains...that has not been resolved but probably has something to do with the rating formula functions or Rybka 4 Default just having very bad luck initially.
Looking forward for results & parameters.
Up to know I have not seen any reel improvement by changing the parameters.
regards
Up to know I have not seen any reel improvement by changing the parameters.
regards
> Up to know I have not seen any reel improvement by changing the parameters.
Neither me.
You also can see here:
http://www.computerchess.org.uk/ccrl/4040/rating_list_all.html that the claim that Rybka 4 64-bit TC3100150 4CPU was stronger than default was just an illusion.
Regards,
Gaмßito.
Well, I would not characterize it as an illusion exactly. It is just that it only out performs at short time controls: http://www.computerchess.org.uk/ccrl/404/rating_list_all.html
I found those timings effective as well, but I have not tried to refine them, though they are included. All my efforts have been directed at piece valuation beyond the initial investigations of parameters given in the forums. I am hoping by doing so, that the improvements are general. Still it is conceivable that the parameters I am working on are only or most effective at faster time controls.
Ah just remembered, I did investigate the rook endgame scaling thing. That got me nowhere. 100 is the optimal, any deviation is destructive according to my findings.
I found those timings effective as well, but I have not tried to refine them, though they are included. All my efforts have been directed at piece valuation beyond the initial investigations of parameters given in the forums. I am hoping by doing so, that the improvements are general. Still it is conceivable that the parameters I am working on are only or most effective at faster time controls.
Ah just remembered, I did investigate the rook endgame scaling thing. That got me nowhere. 100 is the optimal, any deviation is destructive according to my findings.
"It is just that it only out performs at short time controls:"
LOL, so at my 30 days/move controls, the defaults are better.
LOL, so at my 30 days/move controls, the defaults are better.
I guess you just need 1000 computers and a couple years to find out .
Sounds interesting. I will stay tuned
Wayne
Wayne
As promised, here is the update after 800 games each. I plan at least 300 more per engine. The current competitors are Exp 30-Exp 38.
I suppose it is not that surprising that Exp 38 could not continue that performance, however it did manage a narrow overall victory at this stage.
For a second I thought it had surrendered its first place; the Fritz software actually mis-sorted! That's a new one.
Each generation/cycle has seen a ratings increase over the previous generation. It looks like this time it will not be a big jump; 24 was the leader of the last generation, 18 before that, and 8 and Forum before that. The current gain over the default settings is 44Elo. But as I have already stated these settings are not configured for direct head to head matches with other R4 configurations only other top engines in say the first 10-15 places on most ratings lists.
Here is the new list:
R G
1 Rybka 4 x64 Exp 38 v1 3337 806
2 Rybka 4 x64 Exp 31 v1 3336 810
3 Rybka 4 x64 Exp 24 v2 3332 1331
4 Rybka 4 x64 Exp 37 v1 3331 807
5 Rybka 4 x64 Exp 30 v1 3330 810
6 Rybka 4 x64 Exp 26 v2 3329 1329
7 Rybka 4 x64 Exp 34 v1 3329 810
8 Rybka 4 x64 Exp 28 v1 3325 1327
9 Rybka 4 x64 Exp 25 v2 3324 1330
10 Rybka 4 x64 Exp 18 v1 3321 3039
11 Rybka 4 x64 Exp 11 v1 3320 453
12 Rybka 4 x64 Exp 27 v1 3319 1328
13 Rybka 4 x64 Exp 32 v1 3318 810
14 Rybka 4 x64 Exp 36 v1 3318 807
15 Rybka 4 x64 Exp 33 v1 3318 810
16 Rybka 4 x64 Exp 22 v1 3317 2231
17 Rybka 4 x64 Exp 35 v1 3317 809
18 Rybka 4 x64 Exp 29 v1 3312 1327
19 Rybka 4 x64 Exp 23 v1 3311 2178
20 Rybka 4 x64 Exp 8 v1 3309 1310
21 Rybka 4 x64 Exp 9 v1 3306 429
22 Rybka 4 x64 Exp 14 Human 3304 898
23 Rybka 4 x64 Exp 7 v1 3301 815
24 Rybka 4 x64 Exp 4 v1 3298 898
25 Deep Rybka 4 x64 v1 3298 161
26 Rybka 4 x64 Forum v1 3296 1159
27 Rybka 4 x64 Exp 1 v1 3296 916
28 Rybka 4 x64 Exp 20 v1 3297 590
29 Rybka 4 x64 Exp 15 v1 3296 1799
30 Rybka 4 x64 Exp 3 v2 3296 422
31 Rybka 4 x64 Exp 16 v1 3294 555
32 Deep Rybka 4 x64 3293 1986
33 Rybka 4 x64 Exp 21 v1 3293 900
34 Rybka 4 x64 Exp 10 v1 3289 459
35 Rybka 4 x64 Exp 19 v1 3289 189
36 Deep Rybka 4 x64 Lasker 3285 477
37 Rybka 4 x64 Exp 13 vC13510 3283 749
38 Rybka 4 x64 Exp 2 v1 3282 471
39 Rybka 4 x64 Beta 15 v1 3282 426
40 Rybka 4 x64 Exp 12 v1 3282 896
41 Rybka 3 3279 1897
42 Rybka 3 Dynamic 3279 892
43 Rybka 3 Human 3277 1461
44 Rybka 4 x64 Exp 17 v1 3270 354
45 Rybka 4 x64 Exp 6 v1 3257 121
46 Deep Rybka 4 x64 Human 3255 174
47 Stockfish 1.8 JA 64bit 3245 3118
48 Stockfish 1.7.1 JA 64bit 4t 3233 4225
49 Stockfish 1.7 JA 64bit 4t 3222 560
50 Stockfish 1.6.3 JA 64bit 3205 384
51 Critter 0.80 64-bit 3172 1800
52 Deep Fritz 11 3115 4772
53 Naum 4 4t 3112 3414
54 HIARCS 12 MP 3105 3283
55 Deep Shredder 12 3101 1824
56 spark-0.4 3093 1802
57 Critter 0.70 64-bit 3075 2324
58 Komodo64 1.1 JA 3075 2333
59 Zappa Mexico II 3068 2519
60 Protector 1.3.3 x64 4t 3067 385
61 Deep Shredder 11 UCI 3061 2410
62 Komodo64 1.0 JA 3051 407
63 bright-0.5c 3032 2350
64 bright-0.4a 3014 21
65 Protector 1.3.6 x64 4t 2968 900
66 Thinker54AInert-MP64-UCI 2959 900
67 spark-0.3a 2933 23
68 Deep Fritz 12 2651 385
Sorry about the alignment; it just does that when I copy and paste from Excel.
I suppose it is not that surprising that Exp 38 could not continue that performance, however it did manage a narrow overall victory at this stage.
For a second I thought it had surrendered its first place; the Fritz software actually mis-sorted! That's a new one.
Each generation/cycle has seen a ratings increase over the previous generation. It looks like this time it will not be a big jump; 24 was the leader of the last generation, 18 before that, and 8 and Forum before that. The current gain over the default settings is 44Elo. But as I have already stated these settings are not configured for direct head to head matches with other R4 configurations only other top engines in say the first 10-15 places on most ratings lists.
Here is the new list:
R G
1 Rybka 4 x64 Exp 38 v1 3337 806
2 Rybka 4 x64 Exp 31 v1 3336 810
3 Rybka 4 x64 Exp 24 v2 3332 1331
4 Rybka 4 x64 Exp 37 v1 3331 807
5 Rybka 4 x64 Exp 30 v1 3330 810
6 Rybka 4 x64 Exp 26 v2 3329 1329
7 Rybka 4 x64 Exp 34 v1 3329 810
8 Rybka 4 x64 Exp 28 v1 3325 1327
9 Rybka 4 x64 Exp 25 v2 3324 1330
10 Rybka 4 x64 Exp 18 v1 3321 3039
11 Rybka 4 x64 Exp 11 v1 3320 453
12 Rybka 4 x64 Exp 27 v1 3319 1328
13 Rybka 4 x64 Exp 32 v1 3318 810
14 Rybka 4 x64 Exp 36 v1 3318 807
15 Rybka 4 x64 Exp 33 v1 3318 810
16 Rybka 4 x64 Exp 22 v1 3317 2231
17 Rybka 4 x64 Exp 35 v1 3317 809
18 Rybka 4 x64 Exp 29 v1 3312 1327
19 Rybka 4 x64 Exp 23 v1 3311 2178
20 Rybka 4 x64 Exp 8 v1 3309 1310
21 Rybka 4 x64 Exp 9 v1 3306 429
22 Rybka 4 x64 Exp 14 Human 3304 898
23 Rybka 4 x64 Exp 7 v1 3301 815
24 Rybka 4 x64 Exp 4 v1 3298 898
25 Deep Rybka 4 x64 v1 3298 161
26 Rybka 4 x64 Forum v1 3296 1159
27 Rybka 4 x64 Exp 1 v1 3296 916
28 Rybka 4 x64 Exp 20 v1 3297 590
29 Rybka 4 x64 Exp 15 v1 3296 1799
30 Rybka 4 x64 Exp 3 v2 3296 422
31 Rybka 4 x64 Exp 16 v1 3294 555
32 Deep Rybka 4 x64 3293 1986
33 Rybka 4 x64 Exp 21 v1 3293 900
34 Rybka 4 x64 Exp 10 v1 3289 459
35 Rybka 4 x64 Exp 19 v1 3289 189
36 Deep Rybka 4 x64 Lasker 3285 477
37 Rybka 4 x64 Exp 13 vC13510 3283 749
38 Rybka 4 x64 Exp 2 v1 3282 471
39 Rybka 4 x64 Beta 15 v1 3282 426
40 Rybka 4 x64 Exp 12 v1 3282 896
41 Rybka 3 3279 1897
42 Rybka 3 Dynamic 3279 892
43 Rybka 3 Human 3277 1461
44 Rybka 4 x64 Exp 17 v1 3270 354
45 Rybka 4 x64 Exp 6 v1 3257 121
46 Deep Rybka 4 x64 Human 3255 174
47 Stockfish 1.8 JA 64bit 3245 3118
48 Stockfish 1.7.1 JA 64bit 4t 3233 4225
49 Stockfish 1.7 JA 64bit 4t 3222 560
50 Stockfish 1.6.3 JA 64bit 3205 384
51 Critter 0.80 64-bit 3172 1800
52 Deep Fritz 11 3115 4772
53 Naum 4 4t 3112 3414
54 HIARCS 12 MP 3105 3283
55 Deep Shredder 12 3101 1824
56 spark-0.4 3093 1802
57 Critter 0.70 64-bit 3075 2324
58 Komodo64 1.1 JA 3075 2333
59 Zappa Mexico II 3068 2519
60 Protector 1.3.3 x64 4t 3067 385
61 Deep Shredder 11 UCI 3061 2410
62 Komodo64 1.0 JA 3051 407
63 bright-0.5c 3032 2350
64 bright-0.4a 3014 21
65 Protector 1.3.6 x64 4t 2968 900
66 Thinker54AInert-MP64-UCI 2959 900
67 spark-0.3a 2933 23
68 Deep Fritz 12 2651 385
Sorry about the alignment; it just does that when I copy and paste from Excel.

Thanks for the report, what are the settings of Exp 38?
"And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results." ;)
It would be interesting to carefully look at the statistics to see if any of these are really better than the default (in a statistically significant sense), or whether as Dagh suggested, this is just a matter of having a large number of similar strength settings, with some unsurprisingly faring better than others due to statistical noise.
There's still the question of why new experimental parameters are consistently doing better than old ones (i.e. he always finds a new one that performs better than all the old ones).
First, I'm not sure if he is changing time allocation values. As a person who is only interested in analysis strength, I have zero interest in this aspect of tuning. Second, its very easy, and natural, to come up with great results using a mixture of a large number of candidates and survival bias (just get rid of the laggards after 1000 games or so and you will end up with miraculous results, even if if all the engines are identical in every way.
Anyway, as I stated below, I'm very willing to be proven wrong (but also rather skeptical).
Anyway, as I stated below, I'm very willing to be proven wrong (but also rather skeptical).
If it were statistical noise then each cycle would not perform better than the one before as a group. If you will notice, the latest 9 engines finished no lower than 17th place and even the lowest one is 24 Elo higher than the Default settings. If you took the group to be the same engine then that engine would have played 7269 games and earned an Elo of 3326 hardly statistical noise and an improvement of 33 Elo.
Actually, what I should have said is statistical noise combined with survivor bias. I should further point out that my comments apply only to non-time allocation changes. Everyone agrees that the time allocation for engine games in R4 is God awful, but this has zero relevance to the large majority of people using the engine for analysis purposes rather than engine-engine games.
I'll admit that I'm skeptical that you have found changes in piece values that measurably improve strength, but I would be happy to be proven wrong, Perhaps after you release the settings, the piece value settings can be tested against a gamut of other engines by a third party tester and the results can be compared to the default piece value settings.
I'll admit that I'm skeptical that you have found changes in piece values that measurably improve strength, but I would be happy to be proven wrong, Perhaps after you release the settings, the piece value settings can be tested against a gamut of other engines by a third party tester and the results can be compared to the default piece value settings.
I predict that mindbreaker's improvements will hold against non-Rybka engines.
I hope that their are improvements due to non-time-management parameter changes. But we shall see...
In the first generation I tried a few timings that were given on the site but pretty much since then (second to fifth generation each with roughly 9 engines per generation) I have not messed with the timings. The last 38 engines have been 3 100 150 except a handful of engines which were default timings instead. As timings have been essentially constant, the piece values should be what is stratifying their performances.
Then default settings with only 3 100 150 changed should appear as reference (so timings can be ignored).
It is just statistical noise but check out what Exp 30 is doing to Komodo ;)

Here is the current ratings table (without the vs Komodo round as it is not complete). I will probably run one more cannon fodder engine after that. As you can see, we have a new leader!
1 Rybka 4 x64 Exp 31 v1 3346 1053
2 Rybka 4 x64 Exp 38 v1 3341 1050
3 Rybka 4 x64 Exp 30 v1 3339 1053
4 Rybka 4 x64 Exp 37 v1 3338 1051
5 Rybka 4 x64 Exp 24 v2 3335 1331
6 Rybka 4 x64 Exp 34 v1 3333 1053
7 Rybka 4 x64 Exp 36 v1 3333 1051
8 Rybka 4 x64 Exp 26 v2 3332 1329
9 Rybka 4 x64 Exp 32 v1 3330 1053
10 Rybka 4 x64 Exp 28 v1 3329 1327
11 Rybka 4 x64 Exp 25 v2 3328 1330
12 Rybka 4 x64 Exp 33 v1 3327 1053
13 Rybka 4 x64 Exp 11 v1 3326 454
14 Rybka 4 x64 Exp 35 v1 3324 1052
15 Rybka 4 x64 Exp 18 v1 3323 3039
16 Rybka 4 x64 Exp 27 v1 3322 1328
17 Rybka 4 x64 Exp 22 v1 3319 2231
18 Rybka 4 x64 Exp 29 v1 3316 1327
19 Rybka 4 x64 Exp 23 v1 3314 2178
20 Rybka 4 x64 Exp 8 v1 3311 1315
21 Rybka 4 x64 Exp 9 v1 3312 434
22 Rybka 4 x64 Exp 14 Human 3305 900
23 Rybka 4 x64 Exp 7 v1 3303 815
24 Rybka 4 x64 Forum v1 3299 1160
25 Rybka 4 x64 Exp 4 v1 3299 900
26 Rybka 4 x64 Exp 20 v1 3298 590
27 Deep Rybka 4 x64 v1 3299 161
28 Rybka 4 x64 Exp 1 v1 3297 916
29 Rybka 4 x64 Exp 15 v1 3297 1800
30 Rybka 4 x64 Exp 3 v2 3297 423
31 Rybka 4 x64 Exp 21 v1 3294 900
32 Rybka 4 x64 Exp 16 v1 3294 557
33 Rybka 4 x64 Exp 10 v1 3295 462
34 Deep Rybka 4 x64 3294 1986
35 Rybka 4 x64 Exp 19 v1 3290 189
36 Deep Rybka 4 x64 Lasker 3286 479
37 Rybka 4 x64 Exp 13 vC13510 3284 750
38 Rybka 4 x64 Exp 2 v1 3283 472
39 Rybka 4 x64 Beta 15 v1 3283 426
40 Rybka 4 x64 Exp 12 v1 3282 900
41 Rybka 3 3279 1897
42 Rybka 3 Dynamic 3279 892
43 Rybka 3 Human 3276 1461
44 Rybka 4 x64 Exp 17 v1 3270 355
45 Rybka 4 x64 Exp 6 v1 3258 121
46 Deep Rybka 4 x64 Human 3256 174
47 Stockfish 1.8 JA 64bit 3251 3131
48 Stockfish 1.7.1 JA 64bit 4t 3234 4235
49 ****** 3221 24
50 Stockfish 1.7 JA 64bit 4t 3220 560
51 Stockfish 1.6.3 JA 64bit 3202 384
52 Critter 0.80 64-bit 3178 1800
53 Deep Fritz 11 3118 4774
54 HIARCS 12 MP 3115 4183
55 Deep Shredder 12 3107 1824
56 Naum 4 4t 3103 4311
57 spark-0.4 3099 1802
58 Komodo64 1.1 JA 3077 2338
59 Critter 0.70 64-bit 3077 2326
60 Zappa Mexico II 3072 2519
61 Deep Shredder 11 UCI 3062 2410
62 Komodo64 1.0 JA 3049 407
63 bright-0.5c 3033 2350
64 Protector 1.3.3 x64 4t 3027 782
65 bright-0.4a 3016 21
66 Protector 1.3.6 x64 4t 2972 900
67 Thinker54AInert-MP64-UCI 2962 900
68 spark-0.3a 2931 23
69 Deep Fritz 12 2654 385
Oh, ignore the v1/v2 stuff it is meaningless.
1 Rybka 4 x64 Exp 31 v1 3346 1053
2 Rybka 4 x64 Exp 38 v1 3341 1050
3 Rybka 4 x64 Exp 30 v1 3339 1053
4 Rybka 4 x64 Exp 37 v1 3338 1051
5 Rybka 4 x64 Exp 24 v2 3335 1331
6 Rybka 4 x64 Exp 34 v1 3333 1053
7 Rybka 4 x64 Exp 36 v1 3333 1051
8 Rybka 4 x64 Exp 26 v2 3332 1329
9 Rybka 4 x64 Exp 32 v1 3330 1053
10 Rybka 4 x64 Exp 28 v1 3329 1327
11 Rybka 4 x64 Exp 25 v2 3328 1330
12 Rybka 4 x64 Exp 33 v1 3327 1053
13 Rybka 4 x64 Exp 11 v1 3326 454
14 Rybka 4 x64 Exp 35 v1 3324 1052
15 Rybka 4 x64 Exp 18 v1 3323 3039
16 Rybka 4 x64 Exp 27 v1 3322 1328
17 Rybka 4 x64 Exp 22 v1 3319 2231
18 Rybka 4 x64 Exp 29 v1 3316 1327
19 Rybka 4 x64 Exp 23 v1 3314 2178
20 Rybka 4 x64 Exp 8 v1 3311 1315
21 Rybka 4 x64 Exp 9 v1 3312 434
22 Rybka 4 x64 Exp 14 Human 3305 900
23 Rybka 4 x64 Exp 7 v1 3303 815
24 Rybka 4 x64 Forum v1 3299 1160
25 Rybka 4 x64 Exp 4 v1 3299 900
26 Rybka 4 x64 Exp 20 v1 3298 590
27 Deep Rybka 4 x64 v1 3299 161
28 Rybka 4 x64 Exp 1 v1 3297 916
29 Rybka 4 x64 Exp 15 v1 3297 1800
30 Rybka 4 x64 Exp 3 v2 3297 423
31 Rybka 4 x64 Exp 21 v1 3294 900
32 Rybka 4 x64 Exp 16 v1 3294 557
33 Rybka 4 x64 Exp 10 v1 3295 462
34 Deep Rybka 4 x64 3294 1986
35 Rybka 4 x64 Exp 19 v1 3290 189
36 Deep Rybka 4 x64 Lasker 3286 479
37 Rybka 4 x64 Exp 13 vC13510 3284 750
38 Rybka 4 x64 Exp 2 v1 3283 472
39 Rybka 4 x64 Beta 15 v1 3283 426
40 Rybka 4 x64 Exp 12 v1 3282 900
41 Rybka 3 3279 1897
42 Rybka 3 Dynamic 3279 892
43 Rybka 3 Human 3276 1461
44 Rybka 4 x64 Exp 17 v1 3270 355
45 Rybka 4 x64 Exp 6 v1 3258 121
46 Deep Rybka 4 x64 Human 3256 174
47 Stockfish 1.8 JA 64bit 3251 3131
48 Stockfish 1.7.1 JA 64bit 4t 3234 4235
49 ****** 3221 24
50 Stockfish 1.7 JA 64bit 4t 3220 560
51 Stockfish 1.6.3 JA 64bit 3202 384
52 Critter 0.80 64-bit 3178 1800
53 Deep Fritz 11 3118 4774
54 HIARCS 12 MP 3115 4183
55 Deep Shredder 12 3107 1824
56 Naum 4 4t 3103 4311
57 spark-0.4 3099 1802
58 Komodo64 1.1 JA 3077 2338
59 Critter 0.70 64-bit 3077 2326
60 Zappa Mexico II 3072 2519
61 Deep Shredder 11 UCI 3062 2410
62 Komodo64 1.0 JA 3049 407
63 bright-0.5c 3033 2350
64 Protector 1.3.3 x64 4t 3027 782
65 bright-0.4a 3016 21
66 Protector 1.3.6 x64 4t 2972 900
67 Thinker54AInert-MP64-UCI 2962 900
68 spark-0.3a 2931 23
69 Deep Fritz 12 2654 385
Oh, ignore the v1/v2 stuff it is meaningless.
Sorry, I guess one clone slipped through.
That is to say?
So after the Komodo test,you will have again a new leader ,the Exp 30 ..because at the moment Exp 38 & 31 get the lowest result! right?
JP.
JP.
Even with about 85 rounds complete we really don't know which is best against Komodo. We would need maybe 3000 rounds or more for that. I am just getting games from several opponents which when collected together tells me the top handful of engines.
For Elo testing purposes, engines that are nearly equal give the most valuable rating information. Spending a large amount of time on matches against Komodo where Rybka's win percentage is ~90%, is not time well spent.
I am running out of opponents. Komodo is the 9th best engine: http://www.computerchess.org.uk/ccrl/404.live/
There is one engine not on the table because it never won or drew a game so rating could not be calculated. I ended that after 126 games. It is rated 2950 at CCRL. If you make them strong, they have this tendency of winning ;)
I am trying to use the strongest opponents I can find. I have run all the stronger opponents I have except Rybka 3, Rybka 3 Dynamic, and Rybka 3 Human. Maybe I will run those but I am not a fan of running different versions of the same program against one-another.
And who is to say that strength is only determined by close pairings? Should not ratings be legitimate even with some distance between opponents provided there are some draws and losses by the stronger side? According to critiques of the current ratings formulas it is actually the stronger side's rating that is underestimated by Elo tables.
A strait line is better than Elo curve: http://www.chessbase.com/newsdetail.asp?newsid=562
There is one engine not on the table because it never won or drew a game so rating could not be calculated. I ended that after 126 games. It is rated 2950 at CCRL. If you make them strong, they have this tendency of winning ;)
I am trying to use the strongest opponents I can find. I have run all the stronger opponents I have except Rybka 3, Rybka 3 Dynamic, and Rybka 3 Human. Maybe I will run those but I am not a fan of running different versions of the same program against one-another.
And who is to say that strength is only determined by close pairings? Should not ratings be legitimate even with some distance between opponents provided there are some draws and losses by the stronger side? According to critiques of the current ratings formulas it is actually the stronger side's rating that is underestimated by Elo tables.
A strait line is better than Elo curve: http://www.chessbase.com/newsdetail.asp?newsid=562
First, how many cores is Rybka 4 using when playing against Komodo? I hope the answer is 1 (potentially allowing you to play multiple simultaneous games).
Jeff Sonas' excellent article does not argue against the fact that closely spaced opponents provide more rating information than widely spaced opponents. This is shown in the first graph where the deviation of two opponents at the same Elo is much smaller than the deviation at +/- 300 Elo. It's always going to be more difficult to make predictions based on the tail of the distribution. One way to achieve this result would be to give non R4 engines a time advantage when they play against Rybka. This would diminish your ability to determine how much better R4 is than the other engines, but would enhance your ability to discriminate between different flavors of R4.
Also note that Jeff's results are based on a list of human-human games with constrained rating differences between players (I think he mentions 100-120 Elo for the top players). I suspect he would have ended up with somewhat different results if he had instead relied on a database of engine-engine games (this would be an interesting experiment). One would expect engines to be more consistent than people for a couple of reasons:
- They don't have bad days and don't make random blunders, and
- Their strength doesn't really change over time as peoples do)
For these reasons and others, I suspect that if Jeff generated a Sonas E rating systems for engine-engine games, it would have significant differences from the optimal human-human predictive rating system he developed.
Jeff Sonas' excellent article does not argue against the fact that closely spaced opponents provide more rating information than widely spaced opponents. This is shown in the first graph where the deviation of two opponents at the same Elo is much smaller than the deviation at +/- 300 Elo. It's always going to be more difficult to make predictions based on the tail of the distribution. One way to achieve this result would be to give non R4 engines a time advantage when they play against Rybka. This would diminish your ability to determine how much better R4 is than the other engines, but would enhance your ability to discriminate between different flavors of R4.
Also note that Jeff's results are based on a list of human-human games with constrained rating differences between players (I think he mentions 100-120 Elo for the top players). I suspect he would have ended up with somewhat different results if he had instead relied on a database of engine-engine games (this would be an interesting experiment). One would expect engines to be more consistent than people for a couple of reasons:
- They don't have bad days and don't make random blunders, and
- Their strength doesn't really change over time as peoples do)
For these reasons and others, I suspect that if Jeff generated a Sonas E rating systems for engine-engine games, it would have significant differences from the optimal human-human predictive rating system he developed.
All of my variants are running at 4-threads as I have repeatedly stated. Komodo is rated 9th even though it is one thread. The rating is the rating.
You get more deviation at the ends because there are less games in the database with high ratings disparity. I also think that when a player does earn the chance to play a much higher opponent it is because they are playing better than their rating or they are promising juniors...whose ratings may not be able to keep pace with their rate of improvement.
Time handicap is possible but not what I am after. I am trying to find where the engines would actually end-up on a ratings chart. Handicapping engines makes any correction guesswork.
If anything, I suspect the result would be more linear. It would likely reach a point where it was just impossible for the weaker engine to win. What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.
You get more deviation at the ends because there are less games in the database with high ratings disparity. I also think that when a player does earn the chance to play a much higher opponent it is because they are playing better than their rating or they are promising juniors...whose ratings may not be able to keep pace with their rate of improvement.
Time handicap is possible but not what I am after. I am trying to find where the engines would actually end-up on a ratings chart. Handicapping engines makes any correction guesswork.
If anything, I suspect the result would be more linear. It would likely reach a point where it was just impossible for the weaker engine to win. What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.
You get more deviation at the ends because there are less games in the database with high ratings disparity.
This is not a reasonable explanation. You have a scatter plot and for near equal ratings, where most of the points are falling, you see very few outliers, whereas when you have significantly different ratings, where there are a lot fewer games, you see many more outliers. This shows that the rating has better predictive results when the two players have similar ratings. In this case, this also works in reverse, i.e. the true strength of one of the entities is easier to ascertain if it is playing against an entity having nearly equal rating.
I am trying to find where the engines would actually end-up on a ratings chart.
Once again, if you primarily want to know the engines are X Elo better than Shredder or Komodo, than your method is appropriate. On the other hand, if you primarily want to know which parameter variations are stronger against the other engines, a time handicap will lead to faster convergence. With this approach, you would first find which parameter variation works best, and then test only that variation against the other engines without the time handicap.
What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.
If you are playing with reversing colors, you can do this by throwing away all sets of openings where:
White won both games - under the assumption that white left book better,
Black won both games - under the assumption that black left book better, and
Both games were drawn - under the assumption that the book exit position was drawish.
This leaves 2 game trials where one game was drawn and the other was decisive, and where one engine won with both colors. This might be a better method of figuring out if one engine is better than another (it will have less bias), but it won't correlate directly with Elo.
This is not a reasonable explanation. You have a scatter plot and for near equal ratings, where most of the points are falling, you see very few outliers, whereas when you have significantly different ratings, where there are a lot fewer games, you see many more outliers. This shows that the rating has better predictive results when the two players have similar ratings. In this case, this also works in reverse, i.e. the true strength of one of the entities is easier to ascertain if it is playing against an entity having nearly equal rating.
I am trying to find where the engines would actually end-up on a ratings chart.
Once again, if you primarily want to know the engines are X Elo better than Shredder or Komodo, than your method is appropriate. On the other hand, if you primarily want to know which parameter variations are stronger against the other engines, a time handicap will lead to faster convergence. With this approach, you would first find which parameter variation works best, and then test only that variation against the other engines without the time handicap.
What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.
If you are playing with reversing colors, you can do this by throwing away all sets of openings where:
White won both games - under the assumption that white left book better,
Black won both games - under the assumption that black left book better, and
Both games were drawn - under the assumption that the book exit position was drawish.
This leaves 2 game trials where one game was drawn and the other was decisive, and where one engine won with both colors. This might be a better method of figuring out if one engine is better than another (it will have less bias), but it won't correlate directly with Elo.
It is rather inflammatory to claim my argument is unreasonable. I highly doubt that was a normal scatter plot. The whole screen would be black if there were 266,000 games. How could you plot games anyway, they only have three outcomes not percentages unless they are 100%, 50%, and 0% which would make for a dull graph. My guess is that each dot represents the average % of all games (where a game is worth 1 and draws are split .5-.5) with the same Elo difference and color. As there were fewer games with the higher disparities they will be more distorted by chance.
There is no reason the engines have to be close in strength to find a rating hence no reason to deprive us of both relative Elo among variants and relative Elo to other engines.
More games is generally better for a calculation...all sorts of extraneous things average out to nothing. I was unclear. I was talking about something else: the arbitrary equality of two draws to a win.
If, for example, there was a 20 game match between players A and B where player A is 300 Elo stronger than player B and the results were that B got 3 wins, I think that is stronger than if B got 6 draws instead, but current ratings formulas automatically gauge these performances as the same and would in both instances subsequently award the same ratings adjustment.
Sonas is saying the statistical results of many thousands of games should be the guide to the formulas...I agree with that. And we should look into the rate of drawing and rate of winning separately as the chance of a draw may not be double the chance of a win especially for the extremes. Making the error that they are awards more points to the lower player than is appropriate. Of course without the data, this is just an intuition. But is seems hardly likely that double the draw equals a win. It could even be the other direction but the chance that it just lines up...rather small.
There is no reason the engines have to be close in strength to find a rating hence no reason to deprive us of both relative Elo among variants and relative Elo to other engines.
More games is generally better for a calculation...all sorts of extraneous things average out to nothing. I was unclear. I was talking about something else: the arbitrary equality of two draws to a win.
If, for example, there was a 20 game match between players A and B where player A is 300 Elo stronger than player B and the results were that B got 3 wins, I think that is stronger than if B got 6 draws instead, but current ratings formulas automatically gauge these performances as the same and would in both instances subsequently award the same ratings adjustment.
Sonas is saying the statistical results of many thousands of games should be the guide to the formulas...I agree with that. And we should look into the rate of drawing and rate of winning separately as the chance of a draw may not be double the chance of a win especially for the extremes. Making the error that they are awards more points to the lower player than is appropriate. Of course without the data, this is just an intuition. But is seems hardly likely that double the draw equals a win. It could even be the other direction but the chance that it just lines up...rather small.
Komodo final and effect on interim chart.
1 Rybka 4 x64 Exp 31 v1 3354 1153
2 Rybka 4 x64 Exp 30 v1 3349 1153
3 Rybka 4 x64 Exp 37 v1 3349 1151
4 Rybka 4 x64 Exp 38 v1 3349 1150
5 Rybka 4 x64 Exp 34 v1 3345 1153
6 Rybka 4 x64 Exp 36 v1 3343 1151
7 Rybka 4 x64 Exp 24 v2 3341 1331
8 Rybka 4 x64 Exp 33 v1 3339 1153
9 Rybka 4 x64 Exp 26 v2 3338 1329
10 Rybka 4 x64 Exp 32 v1 3338 1153
11 Rybka 4 x64 Exp 35 v1 3336 1152
12 Rybka 4 x64 Exp 28 v1 3335 1327
13 Rybka 4 x64 Exp 25 v2 3334 1330
14 Rybka 4 x64 Exp 11 v1 3333 454
15 Rybka 4 x64 Exp 27 v1 3328 1328
16 Rybka 4 x64 Exp 18 v1 3326 3039
17 Rybka 4 x64 Exp 22 v1 3323 2231
18 Rybka 4 x64 Exp 29 v1 3322 1327
19 Rybka 4 x64 Exp 9 v1 3319 434
20 Rybka 4 x64 Exp 23 v1 3318 2178
21 Rybka 4 x64 Exp 8 v1 3314 1315
22 Rybka 4 x64 Exp 14 Human 3305 900
23 Rybka 4 x64 Exp 7 v1 3303 815
24 Rybka 4 x64 Forum v1 3302 1160
25 Rybka 4 x64 Exp 10 v1 3302 462
26 Rybka 4 x64 Exp 4 v1 3299 900
27 Deep Rybka 4 x64 v1 3299 161
28 Rybka 4 x64 Exp 20 v1 3299 590
29 Rybka 4 x64 Exp 15 v1 3297 1800
30 Rybka 4 x64 Exp 1 v1 3298 916
31 Rybka 4 x64 Exp 3 v2 3297 423
32 Deep Rybka 4 x64 3294 1986
33 Rybka 4 x64 Exp 21 v1 3295 900
34 Rybka 4 x64 Exp 16 v1 3295 557
35 Rybka 4 x64 Exp 19 v1 3290 189
36 Deep Rybka 4 x64 Lasker 3286 479
37 Rybka 4 x64 Exp 13 vC13510 3284 750
38 Rybka 3 Dynamic 3284 892
39 Rybka 4 x64 Exp 2 v1 3284 472
40 Rybka 4 x64 Beta 15 v1 3284 426
41 Rybka 3 3283 1897
42 Rybka 4 x64 Exp 12 v1 3283 900
43 Rybka 3 Human 3281 1461
44 Rybka 4 x64 Exp 17 v1 3271 355
45 Stockfish 1.8 JA 64bit 3258 3131
46 Rybka 4 x64 Exp 6 v1 3259 121
47 Deep Rybka 4 x64 Human 3257 174
48 Stockfish 1.7.1 JA 64bit 4t 3237 4235
49 Stockfish 1.7 JA 64bit 4t 3224 560
50 Stockfish 1.6.3 JA 64bit 3207 384
51 Critter 0.80 64-bit 3186 1800
52 Deep Fritz 11 3123 4774
53 HIARCS 12 MP 3119 4183
54 Deep Shredder 12 3115 1824
55 Naum 4 4t 3108 4311
56 spark-0.4 3107 1802
57 Zappa Mexico II 3079 2519
58 Critter 0.70 64-bit 3078 2326
59 Deep Shredder 11 UCI 3064 2410
60 Komodo64 1.1 JA 3057 3238
61 Komodo64 1.0 JA 3054 407
62 bright-0.5c 3035 2350
63 Protector 1.3.3 x64 4t 3035 782
64 bright-0.4a 3020 21
65 Protector 1.3.6 x64 4t 2977 900
66 Thinker54AInert-MP64-UCI 2967 900
67 spark-0.3a 2935 23
68 Deep Fritz 12 2659 385
1 Rybka 4 x64 Exp 31 v1 3354 1153
2 Rybka 4 x64 Exp 30 v1 3349 1153
3 Rybka 4 x64 Exp 37 v1 3349 1151
4 Rybka 4 x64 Exp 38 v1 3349 1150
5 Rybka 4 x64 Exp 34 v1 3345 1153
6 Rybka 4 x64 Exp 36 v1 3343 1151
7 Rybka 4 x64 Exp 24 v2 3341 1331
8 Rybka 4 x64 Exp 33 v1 3339 1153
9 Rybka 4 x64 Exp 26 v2 3338 1329
10 Rybka 4 x64 Exp 32 v1 3338 1153
11 Rybka 4 x64 Exp 35 v1 3336 1152
12 Rybka 4 x64 Exp 28 v1 3335 1327
13 Rybka 4 x64 Exp 25 v2 3334 1330
14 Rybka 4 x64 Exp 11 v1 3333 454
15 Rybka 4 x64 Exp 27 v1 3328 1328
16 Rybka 4 x64 Exp 18 v1 3326 3039
17 Rybka 4 x64 Exp 22 v1 3323 2231
18 Rybka 4 x64 Exp 29 v1 3322 1327
19 Rybka 4 x64 Exp 9 v1 3319 434
20 Rybka 4 x64 Exp 23 v1 3318 2178
21 Rybka 4 x64 Exp 8 v1 3314 1315
22 Rybka 4 x64 Exp 14 Human 3305 900
23 Rybka 4 x64 Exp 7 v1 3303 815
24 Rybka 4 x64 Forum v1 3302 1160
25 Rybka 4 x64 Exp 10 v1 3302 462
26 Rybka 4 x64 Exp 4 v1 3299 900
27 Deep Rybka 4 x64 v1 3299 161
28 Rybka 4 x64 Exp 20 v1 3299 590
29 Rybka 4 x64 Exp 15 v1 3297 1800
30 Rybka 4 x64 Exp 1 v1 3298 916
31 Rybka 4 x64 Exp 3 v2 3297 423
32 Deep Rybka 4 x64 3294 1986
33 Rybka 4 x64 Exp 21 v1 3295 900
34 Rybka 4 x64 Exp 16 v1 3295 557
35 Rybka 4 x64 Exp 19 v1 3290 189
36 Deep Rybka 4 x64 Lasker 3286 479
37 Rybka 4 x64 Exp 13 vC13510 3284 750
38 Rybka 3 Dynamic 3284 892
39 Rybka 4 x64 Exp 2 v1 3284 472
40 Rybka 4 x64 Beta 15 v1 3284 426
41 Rybka 3 3283 1897
42 Rybka 4 x64 Exp 12 v1 3283 900
43 Rybka 3 Human 3281 1461
44 Rybka 4 x64 Exp 17 v1 3271 355
45 Stockfish 1.8 JA 64bit 3258 3131
46 Rybka 4 x64 Exp 6 v1 3259 121
47 Deep Rybka 4 x64 Human 3257 174
48 Stockfish 1.7.1 JA 64bit 4t 3237 4235
49 Stockfish 1.7 JA 64bit 4t 3224 560
50 Stockfish 1.6.3 JA 64bit 3207 384
51 Critter 0.80 64-bit 3186 1800
52 Deep Fritz 11 3123 4774
53 HIARCS 12 MP 3119 4183
54 Deep Shredder 12 3115 1824
55 Naum 4 4t 3108 4311
56 spark-0.4 3107 1802
57 Zappa Mexico II 3079 2519
58 Critter 0.70 64-bit 3078 2326
59 Deep Shredder 11 UCI 3064 2410
60 Komodo64 1.1 JA 3057 3238
61 Komodo64 1.0 JA 3054 407
62 bright-0.5c 3035 2350
63 Protector 1.3.3 x64 4t 3035 782
64 bright-0.4a 3020 21
65 Protector 1.3.6 x64 4t 2977 900
66 Thinker54AInert-MP64-UCI 2967 900
67 spark-0.3a 2935 23
68 Deep Fritz 12 2659 385

I suspect that if you ran this gauntlet again (i.e. another 900 games), you would find that the ordering of the R4 engine variants has no statistical significance and that the only thing you can ascertain from the games is that R4 with 4 threads is much better than Komodo on one thread.
I believe I said something like that. In itself it is not very meaningful as each mini-match is only 100 games but together with the other ten opponents and their one hundred games verse each engine and you do get something more reliable. There is information but it only shows itself when it comes together when many games are collected like puzzle pieces. Each piece by itself makes little sense but together they reveal a picture.
Hello mindbreaker
after all of your pioneer-job in setting tests, I wonder why you dont post the parameters. So we could try to reproduce ,or even benefit from your work.
I also wonder that nobody seems to be courious what the settings are .
Would you share some of your best settings with us?
Kind regards, Clemens
after all of your pioneer-job in setting tests, I wonder why you dont post the parameters. So we could try to reproduce ,or even benefit from your work.
I also wonder that nobody seems to be courious what the settings are .
Would you share some of your best settings with us?
Kind regards, Clemens
http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?pid=277108#pid277108
"And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results."
"And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results."
Exp 24 has been posted; it is pretty decent according to my tests: http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?pid=275286;hl=exp
And I posted some earlier ones too.
The whole collection will be posted soon.
And I posted some earlier ones too.
The whole collection will be posted soon.
As I have reached 1150 games and will likely do more than the 1200, I thought I should go ahead and post all the parameters even though it is not quite complete. So here it is...attached.
Attachment: MindbreakerR4Experiments.xls (29k)
Hello mindbreaker
thank you for your parameter file. I will try some of them and report here about.
Have a nice day
Clemens
thank you for your parameter file. I will try some of them and report here about.
Have a nice day
Clemens
Topic Rybka Support & Discussion / Rybka Discussion / Parameters Experiment 38: 64 Elo over R4 default
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill