Topic Rybka Support & Discussion / Rybka Discussion / Parameters Experiment 38: 64 Elo over R4 default

My latest batch of Experiments is a little over 400 games in per engine. One is showing very good results and has won 3 of 4 gauntlets and came in a close second in the forth.

They only play against non-R4s, just as ratings are determined in most ratings charts.

I am getting 3357 Elo based on 406 games where Rybka 4 is 3293 Elo based on 1986 games.

It is of course very preliminary, I expect to run the engines 900-1200 games each.

So far I have not seen any indication that I am running out of room to improve the parameters.

The current batch of 9 has completed 100 game matches with: StockFish 1.8, Deep Fritz 11, Critter .8, and an unnameable opponent. The Current opponent is Zappa Mexico II in a gauntlet which has just gotten under way.

I plan to give an update at 800 games in. And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results.

They only play against non-R4s, just as ratings are determined in most ratings charts.

I am getting 3357 Elo based on 406 games where Rybka 4 is 3293 Elo based on 1986 games.

It is of course very preliminary, I expect to run the engines 900-1200 games each.

So far I have not seen any indication that I am running out of room to improve the parameters.

The current batch of 9 has completed 100 game matches with: StockFish 1.8, Deep Fritz 11, Critter .8, and an unnameable opponent. The Current opponent is Zappa Mexico II in a gauntlet which has just gotten under way.

I plan to give an update at 800 games in. And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results.

Looking forward to it, you could be finding new Rybka 4.1 defaults!

I don't think it will become the default though it might be good as an included personality. I suspect R4 was tuned against other variants of itself and is probably optimized to get the best results against other variants of R4. If that is the case my variant could easily loose to it head to head.

What it should be good at is tournaments were there will only be one Rybka 4 or for ratings charts.

I may try for a match version as well that will be very good at not loosing but at the cost of more draws. But I suspect 4.1 will arrive before I get to it.

What it should be good at is tournaments were there will only be one Rybka 4 or for ratings charts.

I may try for a match version as well that will be very good at not loosing but at the cost of more draws. But I suspect 4.1 will arrive before I get to it.

> But I suspect 4.1 will arrive before I get to it.

Or not, Vas is still nowhere to be found (eh, virtually-virtually, in the real-virtual world I could send him an email and he'd answer, but I don't want to bug him).

You don't want to bug him.

> But I suspect

4.1 will arrivebefore I get to it.

You believe everything?

I believe it only when I see it.

Yes, I have noticed this myself in (C) tournaments where you have multiple personalities who all score slightly better or worse than default then having 1 particular personality doing well against those variants but worse against default itself when head/head. very annoying... and back to drawing board :)

probably i'm wrong (because i think that simply deep inside myself can't belive you get 60+ elo over default) but i have the feeling you tested too much variants...

let's suppose you want to test all the engines in this world (obviously i'll exaggerate) by doing a ...let's say... 32-games match agaisnt rybka.

after testing thousands of engines i wound't suprise to see anywhere in the huge table something like Booot 4.15 - rybka 4 19.0 - 13.0

how wrong am i? could you test more the best variant?

let's suppose you want to test all the engines in this world (obviously i'll exaggerate) by doing a ...let's say... 32-games match agaisnt rybka.

after testing thousands of engines i wound't suprise to see anywhere in the huge table something like Booot 4.15 - rybka 4 19.0 - 13.0

how wrong am i? could you test more the best variant?

You need to go to the last page (Tournament of the Champions). Indeed, the best of the variants appear closer to 30 Elo stronger rather than 60. It however is ridiculous to compare my tests to 32-game matches with one opponent. Most of the variants play over 800 games total against 10 or more opponents. As for how the tests pointed at 60+ Elo gains...that has not been resolved but probably has something to do with the rating formula functions or Rybka 4 Default just having very bad luck initially.

Looking forward for results & parameters.

Up to know I have not seen any reel improvement by changing the parameters.

regards

Up to know I have not seen any reel improvement by changing the parameters.

regards

> Up to know I have not seen any reel improvement by changing the parameters.

Neither me.

You also can see here:

http://www.computerchess.org.uk/ccrl/4040/rating_list_all.html that the claim that Rybka 4 64-bit TC3100150 4CPU was stronger than default was just an illusion.

Regards,

Gaмßito.

Well, I would not characterize it as an illusion exactly. It is just that it only out performs at short time controls: http://www.computerchess.org.uk/ccrl/404/rating_list_all.html

I found those timings effective as well, but I have not tried to refine them, though they are included. All my efforts have been directed at piece valuation beyond the initial investigations of parameters given in the forums. I am hoping by doing so, that the improvements are general. Still it is conceivable that the parameters I am working on are only or most effective at faster time controls.

Ah just remembered, I did investigate the rook endgame scaling thing. That got me nowhere. 100 is the optimal, any deviation is destructive according to my findings.

I found those timings effective as well, but I have not tried to refine them, though they are included. All my efforts have been directed at piece valuation beyond the initial investigations of parameters given in the forums. I am hoping by doing so, that the improvements are general. Still it is conceivable that the parameters I am working on are only or most effective at faster time controls.

Ah just remembered, I did investigate the rook endgame scaling thing. That got me nowhere. 100 is the optimal, any deviation is destructive according to my findings.

"It is just that it only out performs at short time controls:"

LOL, so at my 30 days/move controls, the defaults are better.

LOL, so at my 30 days/move controls, the defaults are better.

I guess you just need 1000 computers and a couple years to find out .

Sounds interesting. I will stay tuned

Wayne

Wayne

As promised, here is the update after 800 games each. I plan at least 300 more per engine. The current competitors are Exp 30-Exp 38.

I suppose it is not that surprising that Exp 38 could not continue that performance, however it did manage a narrow overall victory at this stage.

For a second I thought it had surrendered its first place; the Fritz software actually mis-sorted! That's a new one.

Each generation/cycle has seen a ratings increase over the previous generation. It looks like this time it will not be a big jump; 24 was the leader of the last generation, 18 before that, and 8 and Forum before that. The current gain over the default settings is 44Elo. But as I have already stated these settings are not configured for direct head to head matches with other R4 configurations only other top engines in say the first 10-15 places on most ratings lists.

Here is the new list:

R G

2 Rybka 4 x64 Exp 31 v1 3336 810

3 Rybka 4 x64 Exp 24 v2 3332 1331

4 Rybka 4 x64 Exp 37 v1 3331 807

5 Rybka 4 x64 Exp 30 v1 3330 810

6 Rybka 4 x64 Exp 26 v2 3329 1329

7 Rybka 4 x64 Exp 34 v1 3329 810

8 Rybka 4 x64 Exp 28 v1 3325 1327

9 Rybka 4 x64 Exp 25 v2 3324 1330

10 Rybka 4 x64 Exp 18 v1 3321 3039

11 Rybka 4 x64 Exp 11 v1 3320 453

12 Rybka 4 x64 Exp 27 v1 3319 1328

13 Rybka 4 x64 Exp 32 v1 3318 810

14 Rybka 4 x64 Exp 36 v1 3318 807

15 Rybka 4 x64 Exp 33 v1 3318 810

16 Rybka 4 x64 Exp 22 v1 3317 2231

17 Rybka 4 x64 Exp 35 v1 3317 809

18 Rybka 4 x64 Exp 29 v1 3312 1327

19 Rybka 4 x64 Exp 23 v1 3311 2178

20 Rybka 4 x64 Exp 8 v1 3309 1310

21 Rybka 4 x64 Exp 9 v1 3306 429

22 Rybka 4 x64 Exp 14 Human 3304 898

23 Rybka 4 x64 Exp 7 v1 3301 815

24 Rybka 4 x64 Exp 4 v1 3298 898

25 Deep Rybka 4 x64 v1 3298 161

26 Rybka 4 x64 Forum v1 3296 1159

27 Rybka 4 x64 Exp 1 v1 3296 916

28 Rybka 4 x64 Exp 20 v1 3297 590

29 Rybka 4 x64 Exp 15 v1 3296 1799

30 Rybka 4 x64 Exp 3 v2 3296 422

31 Rybka 4 x64 Exp 16 v1 3294 555

33 Rybka 4 x64 Exp 21 v1 3293 900

34 Rybka 4 x64 Exp 10 v1 3289 459

35 Rybka 4 x64 Exp 19 v1 3289 189

36 Deep Rybka 4 x64 Lasker 3285 477

37 Rybka 4 x64 Exp 13 vC13510 3283 749

38 Rybka 4 x64 Exp 2 v1 3282 471

39 Rybka 4 x64 Beta 15 v1 3282 426

40 Rybka 4 x64 Exp 12 v1 3282 896

41 Rybka 3 3279 1897

42 Rybka 3 Dynamic 3279 892

43 Rybka 3 Human 3277 1461

44 Rybka 4 x64 Exp 17 v1 3270 354

45 Rybka 4 x64 Exp 6 v1 3257 121

46 Deep Rybka 4 x64 Human 3255 174

47 Stockfish 1.8 JA 64bit 3245 3118

48 Stockfish 1.7.1 JA 64bit 4t 3233 4225

49 Stockfish 1.7 JA 64bit 4t 3222 560

50 Stockfish 1.6.3 JA 64bit 3205 384

51 Critter 0.80 64-bit 3172 1800

52 Deep Fritz 11 3115 4772

53 Naum 4 4t 3112 3414

54 HIARCS 12 MP 3105 3283

55 Deep Shredder 12 3101 1824

56 spark-0.4 3093 1802

57 Critter 0.70 64-bit 3075 2324

58 Komodo64 1.1 JA 3075 2333

59 Zappa Mexico II 3068 2519

60 Protector 1.3.3 x64 4t 3067 385

61 Deep Shredder 11 UCI 3061 2410

62 Komodo64 1.0 JA 3051 407

63 bright-0.5c 3032 2350

64 bright-0.4a 3014 21

65 Protector 1.3.6 x64 4t 2968 900

66 Thinker54AInert-MP64-UCI 2959 900

67 spark-0.3a 2933 23

68 Deep Fritz 12 2651 385

Sorry about the alignment; it just does that when I copy and paste from Excel.

I suppose it is not that surprising that Exp 38 could not continue that performance, however it did manage a narrow overall victory at this stage.

For a second I thought it had surrendered its first place; the Fritz software actually mis-sorted! That's a new one.

Each generation/cycle has seen a ratings increase over the previous generation. It looks like this time it will not be a big jump; 24 was the leader of the last generation, 18 before that, and 8 and Forum before that. The current gain over the default settings is 44Elo. But as I have already stated these settings are not configured for direct head to head matches with other R4 configurations only other top engines in say the first 10-15 places on most ratings lists.

Here is the new list:

R G

**1 Rybka 4 x64 Exp 38 v1 3337 806**2 Rybka 4 x64 Exp 31 v1 3336 810

3 Rybka 4 x64 Exp 24 v2 3332 1331

4 Rybka 4 x64 Exp 37 v1 3331 807

5 Rybka 4 x64 Exp 30 v1 3330 810

6 Rybka 4 x64 Exp 26 v2 3329 1329

7 Rybka 4 x64 Exp 34 v1 3329 810

8 Rybka 4 x64 Exp 28 v1 3325 1327

9 Rybka 4 x64 Exp 25 v2 3324 1330

10 Rybka 4 x64 Exp 18 v1 3321 3039

11 Rybka 4 x64 Exp 11 v1 3320 453

12 Rybka 4 x64 Exp 27 v1 3319 1328

13 Rybka 4 x64 Exp 32 v1 3318 810

14 Rybka 4 x64 Exp 36 v1 3318 807

15 Rybka 4 x64 Exp 33 v1 3318 810

16 Rybka 4 x64 Exp 22 v1 3317 2231

17 Rybka 4 x64 Exp 35 v1 3317 809

18 Rybka 4 x64 Exp 29 v1 3312 1327

19 Rybka 4 x64 Exp 23 v1 3311 2178

20 Rybka 4 x64 Exp 8 v1 3309 1310

21 Rybka 4 x64 Exp 9 v1 3306 429

22 Rybka 4 x64 Exp 14 Human 3304 898

23 Rybka 4 x64 Exp 7 v1 3301 815

24 Rybka 4 x64 Exp 4 v1 3298 898

25 Deep Rybka 4 x64 v1 3298 161

26 Rybka 4 x64 Forum v1 3296 1159

27 Rybka 4 x64 Exp 1 v1 3296 916

28 Rybka 4 x64 Exp 20 v1 3297 590

29 Rybka 4 x64 Exp 15 v1 3296 1799

30 Rybka 4 x64 Exp 3 v2 3296 422

31 Rybka 4 x64 Exp 16 v1 3294 555

**32 Deep Rybka 4 x64 3293 1986**33 Rybka 4 x64 Exp 21 v1 3293 900

34 Rybka 4 x64 Exp 10 v1 3289 459

35 Rybka 4 x64 Exp 19 v1 3289 189

36 Deep Rybka 4 x64 Lasker 3285 477

37 Rybka 4 x64 Exp 13 vC13510 3283 749

38 Rybka 4 x64 Exp 2 v1 3282 471

39 Rybka 4 x64 Beta 15 v1 3282 426

40 Rybka 4 x64 Exp 12 v1 3282 896

41 Rybka 3 3279 1897

42 Rybka 3 Dynamic 3279 892

43 Rybka 3 Human 3277 1461

44 Rybka 4 x64 Exp 17 v1 3270 354

45 Rybka 4 x64 Exp 6 v1 3257 121

46 Deep Rybka 4 x64 Human 3255 174

47 Stockfish 1.8 JA 64bit 3245 3118

48 Stockfish 1.7.1 JA 64bit 4t 3233 4225

49 Stockfish 1.7 JA 64bit 4t 3222 560

50 Stockfish 1.6.3 JA 64bit 3205 384

51 Critter 0.80 64-bit 3172 1800

52 Deep Fritz 11 3115 4772

53 Naum 4 4t 3112 3414

54 HIARCS 12 MP 3105 3283

55 Deep Shredder 12 3101 1824

56 spark-0.4 3093 1802

57 Critter 0.70 64-bit 3075 2324

58 Komodo64 1.1 JA 3075 2333

59 Zappa Mexico II 3068 2519

60 Protector 1.3.3 x64 4t 3067 385

61 Deep Shredder 11 UCI 3061 2410

62 Komodo64 1.0 JA 3051 407

63 bright-0.5c 3032 2350

64 bright-0.4a 3014 21

65 Protector 1.3.6 x64 4t 2968 900

66 Thinker54AInert-MP64-UCI 2959 900

67 spark-0.3a 2933 23

68 Deep Fritz 12 2651 385

Sorry about the alignment; it just does that when I copy and paste from Excel.

Thanks for the report, what are the settings of Exp 38?

"And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results." ;)

It would be interesting to carefully look at the statistics to see if any of these are really better than the default (in a statistically significant sense), or whether as Dagh suggested, this is just a matter of having a large number of similar strength settings, with some unsurprisingly faring better than others due to statistical noise.

There's still the question of why new experimental parameters are consistently doing better than old ones (i.e. he always finds a new one that performs better than all the old ones).

First, I'm not sure if he is changing time allocation values. As a person who is only interested in analysis strength, I have zero interest in this aspect of tuning. Second, its very easy, and natural, to come up with great results using a mixture of a large number of candidates and survival bias (just get rid of the laggards after 1000 games or so and you will end up with miraculous results, even if if all the engines are identical in every way.

Anyway, as I stated below, I'm very willing to be proven wrong (but also rather skeptical).

Anyway, as I stated below, I'm very willing to be proven wrong (but also rather skeptical).

If it were statistical noise then each cycle would not perform better than the one before as a group. If you will notice, the latest 9 engines finished no lower than 17th place and even the lowest one is 24 Elo higher than the Default settings. If you took the group to be the same engine then that engine would have played 7269 games and earned an Elo of 3326 hardly statistical noise and an improvement of 33 Elo.

Actually, what I should have said is statistical noise combined with survivor bias. I should further point out that my comments apply only to non-time allocation changes. Everyone agrees that the time allocation for engine games in R4 is God awful, but this has zero relevance to the large majority of people using the engine for analysis purposes rather than engine-engine games.

I'll admit that I'm skeptical that you have found changes in piece values that measurably improve strength, but I would be happy to be proven wrong, Perhaps after you release the settings, the piece value settings can be tested against a gamut of other engines by a third party tester and the results can be compared to the default piece value settings.

I'll admit that I'm skeptical that you have found changes in piece values that measurably improve strength, but I would be happy to be proven wrong, Perhaps after you release the settings, the piece value settings can be tested against a gamut of other engines by a third party tester and the results can be compared to the default piece value settings.

I predict that mindbreaker's improvements will hold against non-Rybka engines.

I hope that their are improvements due to non-time-management parameter changes. But we shall see...

In the first generation I tried a few timings that were given on the site but pretty much since then (second to fifth generation each with roughly 9 engines per generation) I have not messed with the timings. The last 38 engines have been 3 100 150 except a handful of engines which were default timings instead. As timings have been essentially constant, the piece values should be what is stratifying their performances.

Then default settings with only 3 100 150 changed should appear as reference (so timings can be ignored).

It is just statistical noise but check out what Exp 30 is doing to Komodo ;)

Here is the current ratings table (without the vs Komodo round as it is not complete). I will probably run one more cannon fodder engine after that. As you can see, we have a new leader!

1 Rybka 4 x64 Exp 31 v1 3346 1053

2 Rybka 4 x64 Exp 38 v1 3341 1050

3 Rybka 4 x64 Exp 30 v1 3339 1053

4 Rybka 4 x64 Exp 37 v1 3338 1051

5 Rybka 4 x64 Exp 24 v2 3335 1331

6 Rybka 4 x64 Exp 34 v1 3333 1053

7 Rybka 4 x64 Exp 36 v1 3333 1051

8 Rybka 4 x64 Exp 26 v2 3332 1329

9 Rybka 4 x64 Exp 32 v1 3330 1053

10 Rybka 4 x64 Exp 28 v1 3329 1327

11 Rybka 4 x64 Exp 25 v2 3328 1330

12 Rybka 4 x64 Exp 33 v1 3327 1053

13 Rybka 4 x64 Exp 11 v1 3326 454

14 Rybka 4 x64 Exp 35 v1 3324 1052

15 Rybka 4 x64 Exp 18 v1 3323 3039

16 Rybka 4 x64 Exp 27 v1 3322 1328

17 Rybka 4 x64 Exp 22 v1 3319 2231

18 Rybka 4 x64 Exp 29 v1 3316 1327

19 Rybka 4 x64 Exp 23 v1 3314 2178

20 Rybka 4 x64 Exp 8 v1 3311 1315

21 Rybka 4 x64 Exp 9 v1 3312 434

22 Rybka 4 x64 Exp 14 Human 3305 900

23 Rybka 4 x64 Exp 7 v1 3303 815

24 Rybka 4 x64 Forum v1 3299 1160

25 Rybka 4 x64 Exp 4 v1 3299 900

26 Rybka 4 x64 Exp 20 v1 3298 590

27 Deep Rybka 4 x64 v1 3299 161

28 Rybka 4 x64 Exp 1 v1 3297 916

29 Rybka 4 x64 Exp 15 v1 3297 1800

30 Rybka 4 x64 Exp 3 v2 3297 423

31 Rybka 4 x64 Exp 21 v1 3294 900

32 Rybka 4 x64 Exp 16 v1 3294 557

33 Rybka 4 x64 Exp 10 v1 3295 462

34 Deep Rybka 4 x64 3294 1986

35 Rybka 4 x64 Exp 19 v1 3290 189

36 Deep Rybka 4 x64 Lasker 3286 479

37 Rybka 4 x64 Exp 13 vC13510 3284 750

38 Rybka 4 x64 Exp 2 v1 3283 472

39 Rybka 4 x64 Beta 15 v1 3283 426

40 Rybka 4 x64 Exp 12 v1 3282 900

41 Rybka 3 3279 1897

42 Rybka 3 Dynamic 3279 892

43 Rybka 3 Human 3276 1461

44 Rybka 4 x64 Exp 17 v1 3270 355

45 Rybka 4 x64 Exp 6 v1 3258 121

46 Deep Rybka 4 x64 Human 3256 174

47 Stockfish 1.8 JA 64bit 3251 3131

48 Stockfish 1.7.1 JA 64bit 4t 3234 4235

49 ****** 3221 24

50 Stockfish 1.7 JA 64bit 4t 3220 560

51 Stockfish 1.6.3 JA 64bit 3202 384

52 Critter 0.80 64-bit 3178 1800

53 Deep Fritz 11 3118 4774

54 HIARCS 12 MP 3115 4183

55 Deep Shredder 12 3107 1824

56 Naum 4 4t 3103 4311

57 spark-0.4 3099 1802

58 Komodo64 1.1 JA 3077 2338

59 Critter 0.70 64-bit 3077 2326

60 Zappa Mexico II 3072 2519

61 Deep Shredder 11 UCI 3062 2410

62 Komodo64 1.0 JA 3049 407

63 bright-0.5c 3033 2350

64 Protector 1.3.3 x64 4t 3027 782

65 bright-0.4a 3016 21

66 Protector 1.3.6 x64 4t 2972 900

67 Thinker54AInert-MP64-UCI 2962 900

68 spark-0.3a 2931 23

69 Deep Fritz 12 2654 385

Oh, ignore the v1/v2 stuff it is meaningless.

1 Rybka 4 x64 Exp 31 v1 3346 1053

2 Rybka 4 x64 Exp 38 v1 3341 1050

3 Rybka 4 x64 Exp 30 v1 3339 1053

4 Rybka 4 x64 Exp 37 v1 3338 1051

5 Rybka 4 x64 Exp 24 v2 3335 1331

6 Rybka 4 x64 Exp 34 v1 3333 1053

7 Rybka 4 x64 Exp 36 v1 3333 1051

8 Rybka 4 x64 Exp 26 v2 3332 1329

9 Rybka 4 x64 Exp 32 v1 3330 1053

10 Rybka 4 x64 Exp 28 v1 3329 1327

11 Rybka 4 x64 Exp 25 v2 3328 1330

12 Rybka 4 x64 Exp 33 v1 3327 1053

13 Rybka 4 x64 Exp 11 v1 3326 454

14 Rybka 4 x64 Exp 35 v1 3324 1052

15 Rybka 4 x64 Exp 18 v1 3323 3039

16 Rybka 4 x64 Exp 27 v1 3322 1328

17 Rybka 4 x64 Exp 22 v1 3319 2231

18 Rybka 4 x64 Exp 29 v1 3316 1327

19 Rybka 4 x64 Exp 23 v1 3314 2178

20 Rybka 4 x64 Exp 8 v1 3311 1315

21 Rybka 4 x64 Exp 9 v1 3312 434

22 Rybka 4 x64 Exp 14 Human 3305 900

23 Rybka 4 x64 Exp 7 v1 3303 815

24 Rybka 4 x64 Forum v1 3299 1160

25 Rybka 4 x64 Exp 4 v1 3299 900

26 Rybka 4 x64 Exp 20 v1 3298 590

27 Deep Rybka 4 x64 v1 3299 161

28 Rybka 4 x64 Exp 1 v1 3297 916

29 Rybka 4 x64 Exp 15 v1 3297 1800

30 Rybka 4 x64 Exp 3 v2 3297 423

31 Rybka 4 x64 Exp 21 v1 3294 900

32 Rybka 4 x64 Exp 16 v1 3294 557

33 Rybka 4 x64 Exp 10 v1 3295 462

34 Deep Rybka 4 x64 3294 1986

35 Rybka 4 x64 Exp 19 v1 3290 189

36 Deep Rybka 4 x64 Lasker 3286 479

37 Rybka 4 x64 Exp 13 vC13510 3284 750

38 Rybka 4 x64 Exp 2 v1 3283 472

39 Rybka 4 x64 Beta 15 v1 3283 426

40 Rybka 4 x64 Exp 12 v1 3282 900

41 Rybka 3 3279 1897

42 Rybka 3 Dynamic 3279 892

43 Rybka 3 Human 3276 1461

44 Rybka 4 x64 Exp 17 v1 3270 355

45 Rybka 4 x64 Exp 6 v1 3258 121

46 Deep Rybka 4 x64 Human 3256 174

47 Stockfish 1.8 JA 64bit 3251 3131

48 Stockfish 1.7.1 JA 64bit 4t 3234 4235

49 ****** 3221 24

50 Stockfish 1.7 JA 64bit 4t 3220 560

51 Stockfish 1.6.3 JA 64bit 3202 384

52 Critter 0.80 64-bit 3178 1800

53 Deep Fritz 11 3118 4774

54 HIARCS 12 MP 3115 4183

55 Deep Shredder 12 3107 1824

56 Naum 4 4t 3103 4311

57 spark-0.4 3099 1802

58 Komodo64 1.1 JA 3077 2338

59 Critter 0.70 64-bit 3077 2326

60 Zappa Mexico II 3072 2519

61 Deep Shredder 11 UCI 3062 2410

62 Komodo64 1.0 JA 3049 407

63 bright-0.5c 3033 2350

64 Protector 1.3.3 x64 4t 3027 782

65 bright-0.4a 3016 21

66 Protector 1.3.6 x64 4t 2972 900

67 Thinker54AInert-MP64-UCI 2962 900

68 spark-0.3a 2931 23

69 Deep Fritz 12 2654 385

Oh, ignore the v1/v2 stuff it is meaningless.

Sorry, I guess one clone slipped through.

That is to say?

So after the Komodo test,you will have again a new leader ,the Exp 30 ..because at the moment Exp 38 & 31 get the lowest result! right?

JP.

JP.

Even with about 85 rounds complete we really don't know which is best against Komodo. We would need maybe 3000 rounds or more for that. I am just getting games from several opponents which when collected together tells me the top handful of engines.

For Elo testing purposes, engines that are nearly equal give the most valuable rating information. Spending a large amount of time on matches against Komodo where Rybka's win percentage is ~90%, is not time well spent.

I am running out of opponents. Komodo is the 9th best engine: http://www.computerchess.org.uk/ccrl/404.live/

There is one engine not on the table because it never won or drew a game so rating could not be calculated. I ended that after 126 games. It is rated 2950 at CCRL. If you make them strong, they have this tendency of winning ;)

I am trying to use the strongest opponents I can find. I have run all the stronger opponents I have except Rybka 3, Rybka 3 Dynamic, and Rybka 3 Human. Maybe I will run those but I am not a fan of running different versions of the same program against one-another.

And who is to say that strength is only determined by close pairings? Should not ratings be legitimate even with some distance between opponents provided there are some draws and losses by the stronger side? According to critiques of the current ratings formulas it is actually the stronger side's rating that is underestimated by Elo tables.

A strait line is better than Elo curve: http://www.chessbase.com/newsdetail.asp?newsid=562

There is one engine not on the table because it never won or drew a game so rating could not be calculated. I ended that after 126 games. It is rated 2950 at CCRL. If you make them strong, they have this tendency of winning ;)

I am trying to use the strongest opponents I can find. I have run all the stronger opponents I have except Rybka 3, Rybka 3 Dynamic, and Rybka 3 Human. Maybe I will run those but I am not a fan of running different versions of the same program against one-another.

And who is to say that strength is only determined by close pairings? Should not ratings be legitimate even with some distance between opponents provided there are some draws and losses by the stronger side? According to critiques of the current ratings formulas it is actually the stronger side's rating that is underestimated by Elo tables.

A strait line is better than Elo curve: http://www.chessbase.com/newsdetail.asp?newsid=562

First, how many cores is Rybka 4 using when playing against Komodo? I hope the answer is 1 (potentially allowing you to play multiple simultaneous games).

Jeff Sonas' excellent article does not argue against the fact that closely spaced opponents provide more rating information than widely spaced opponents. This is shown in the first graph where the deviation of two opponents at the same Elo is much smaller than the deviation at +/- 300 Elo. It's always going to be more difficult to make predictions based on the tail of the distribution. One way to achieve this result would be to give non R4 engines a time advantage when they play against Rybka. This would diminish your ability to determine how much better R4 is than the other engines, but would enhance your ability to discriminate between different flavors of R4.

Also note that Jeff's results are based on a list of human-human games with constrained rating differences between players (I think he mentions 100-120 Elo for the top players). I suspect he would have ended up with somewhat different results if he had instead relied on a database of engine-engine games (this would be an interesting experiment). One would expect engines to be more consistent than people for a couple of reasons:

- They don't have bad days and don't make random blunders, and

- Their strength doesn't really change over time as peoples do)

For these reasons and others, I suspect that if Jeff generated a Sonas E rating systems for engine-engine games, it would have significant differences from the optimal human-human predictive rating system he developed.

Jeff Sonas' excellent article does not argue against the fact that closely spaced opponents provide more rating information than widely spaced opponents. This is shown in the first graph where the deviation of two opponents at the same Elo is much smaller than the deviation at +/- 300 Elo. It's always going to be more difficult to make predictions based on the tail of the distribution. One way to achieve this result would be to give non R4 engines a time advantage when they play against Rybka. This would diminish your ability to determine how much better R4 is than the other engines, but would enhance your ability to discriminate between different flavors of R4.

Also note that Jeff's results are based on a list of human-human games with constrained rating differences between players (I think he mentions 100-120 Elo for the top players). I suspect he would have ended up with somewhat different results if he had instead relied on a database of engine-engine games (this would be an interesting experiment). One would expect engines to be more consistent than people for a couple of reasons:

- They don't have bad days and don't make random blunders, and

- Their strength doesn't really change over time as peoples do)

For these reasons and others, I suspect that if Jeff generated a Sonas E rating systems for engine-engine games, it would have significant differences from the optimal human-human predictive rating system he developed.

All of my variants are running at 4-threads as I have repeatedly stated. Komodo is rated 9th even though it is one thread. The rating is the rating.

You get more deviation at the ends because there are less games in the database with high ratings disparity. I also think that when a player does earn the chance to play a much higher opponent it is because they are playing better than their rating or they are promising juniors...whose ratings may not be able to keep pace with their rate of improvement.

Time handicap is possible but not what I am after. I am trying to find where the engines would actually end-up on a ratings chart. Handicapping engines makes any correction guesswork.

If anything, I suspect the result would be more linear. It would likely reach a point where it was just impossible for the weaker engine to win. What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.

You get more deviation at the ends because there are less games in the database with high ratings disparity. I also think that when a player does earn the chance to play a much higher opponent it is because they are playing better than their rating or they are promising juniors...whose ratings may not be able to keep pace with their rate of improvement.

Time handicap is possible but not what I am after. I am trying to find where the engines would actually end-up on a ratings chart. Handicapping engines makes any correction guesswork.

If anything, I suspect the result would be more linear. It would likely reach a point where it was just impossible for the weaker engine to win. What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.

*You get more deviation at the ends because there are less games in the database with high ratings disparity.*

This is not a reasonable explanation. You have a scatter plot and for near equal ratings, where most of the points are falling, you see very few outliers, whereas when you have significantly different ratings, where there are a lot fewer games, you see many more outliers. This shows that the rating has better predictive results when the two players have similar ratings. In this case, this also works in reverse, i.e. the true strength of one of the entities is easier to ascertain if it is playing against an entity having nearly equal rating.

*I am trying to find where the engines would actually end-up on a ratings chart.*

Once again, if you primarily want to know the engines are X Elo better than Shredder or Komodo, than your method is appropriate. On the other hand, if you primarily want to know which parameter variations are stronger against the other engines, a time handicap will lead to faster convergence. With this approach, you would first find which parameter variation works best, and then test only that variation against the other engines without the time handicap.

*What I would like to see is a graph where only decisive games were included, because I think it is easier to get two draws than a win at high ratings disparity. Something that if true should be figured into the ratings.*

If you are playing with reversing colors, you can do this by throwing away all sets of openings where:

White won both games - under the assumption that white left book better,

Black won both games - under the assumption that black left book better, and

Both games were drawn - under the assumption that the book exit position was drawish.

This leaves 2 game trials where one game was drawn and the other was decisive, and where one engine won with both colors. This might be a better method of figuring out if one engine is better than another (it will have less bias), but it won't correlate directly with Elo.

It is rather inflammatory to claim my argument is unreasonable. I highly doubt that was a normal scatter plot. The whole screen would be black if there were 266,000 games. How could you plot games anyway, they only have three outcomes not percentages unless they are 100%, 50%, and 0% which would make for a dull graph. My guess is that each dot represents the average % of all games (where a game is worth 1 and draws are split .5-.5) with the same Elo difference and color. As there were fewer games with the higher disparities they will be more distorted by chance.

There is no reason the engines have to be close in strength to find a rating hence no reason to deprive us of both relative Elo among variants and relative Elo to other engines.

More games is generally better for a calculation...all sorts of extraneous things average out to nothing. I was unclear. I was talking about something else: the arbitrary equality of two draws to a win.

If, for example, there was a 20 game match between players A and B where player A is 300 Elo stronger than player B and the results were that B got 3 wins, I think that is stronger than if B got 6 draws instead, but current ratings formulas automatically gauge these performances as the same and would in both instances subsequently award the same ratings adjustment.

Sonas is saying the statistical results of many thousands of games should be the guide to the formulas...I agree with that. And we should look into the rate of drawing and rate of winning separately as the chance of a draw may not be double the chance of a win especially for the extremes. Making the error that they are awards more points to the lower player than is appropriate. Of course without the data, this is just an intuition. But is seems hardly likely that double the draw equals a win. It could even be the other direction but the chance that it just lines up...rather small.

There is no reason the engines have to be close in strength to find a rating hence no reason to deprive us of both relative Elo among variants and relative Elo to other engines.

More games is generally better for a calculation...all sorts of extraneous things average out to nothing. I was unclear. I was talking about something else: the arbitrary equality of two draws to a win.

If, for example, there was a 20 game match between players A and B where player A is 300 Elo stronger than player B and the results were that B got 3 wins, I think that is stronger than if B got 6 draws instead, but current ratings formulas automatically gauge these performances as the same and would in both instances subsequently award the same ratings adjustment.

Sonas is saying the statistical results of many thousands of games should be the guide to the formulas...I agree with that. And we should look into the rate of drawing and rate of winning separately as the chance of a draw may not be double the chance of a win especially for the extremes. Making the error that they are awards more points to the lower player than is appropriate. Of course without the data, this is just an intuition. But is seems hardly likely that double the draw equals a win. It could even be the other direction but the chance that it just lines up...rather small.

Komodo final and effect on interim chart.

1 Rybka 4 x64 Exp 31 v1 3354 1153

2 Rybka 4 x64 Exp 30 v1 3349 1153

3 Rybka 4 x64 Exp 37 v1 3349 1151

4 Rybka 4 x64 Exp 38 v1 3349 1150

5 Rybka 4 x64 Exp 34 v1 3345 1153

6 Rybka 4 x64 Exp 36 v1 3343 1151

7 Rybka 4 x64 Exp 24 v2 3341 1331

8 Rybka 4 x64 Exp 33 v1 3339 1153

9 Rybka 4 x64 Exp 26 v2 3338 1329

10 Rybka 4 x64 Exp 32 v1 3338 1153

11 Rybka 4 x64 Exp 35 v1 3336 1152

12 Rybka 4 x64 Exp 28 v1 3335 1327

13 Rybka 4 x64 Exp 25 v2 3334 1330

14 Rybka 4 x64 Exp 11 v1 3333 454

15 Rybka 4 x64 Exp 27 v1 3328 1328

16 Rybka 4 x64 Exp 18 v1 3326 3039

17 Rybka 4 x64 Exp 22 v1 3323 2231

18 Rybka 4 x64 Exp 29 v1 3322 1327

19 Rybka 4 x64 Exp 9 v1 3319 434

20 Rybka 4 x64 Exp 23 v1 3318 2178

21 Rybka 4 x64 Exp 8 v1 3314 1315

22 Rybka 4 x64 Exp 14 Human 3305 900

23 Rybka 4 x64 Exp 7 v1 3303 815

24 Rybka 4 x64 Forum v1 3302 1160

25 Rybka 4 x64 Exp 10 v1 3302 462

26 Rybka 4 x64 Exp 4 v1 3299 900

27 Deep Rybka 4 x64 v1 3299 161

28 Rybka 4 x64 Exp 20 v1 3299 590

29 Rybka 4 x64 Exp 15 v1 3297 1800

30 Rybka 4 x64 Exp 1 v1 3298 916

31 Rybka 4 x64 Exp 3 v2 3297 423

32 Deep Rybka 4 x64 3294 1986

33 Rybka 4 x64 Exp 21 v1 3295 900

34 Rybka 4 x64 Exp 16 v1 3295 557

35 Rybka 4 x64 Exp 19 v1 3290 189

36 Deep Rybka 4 x64 Lasker 3286 479

37 Rybka 4 x64 Exp 13 vC13510 3284 750

38 Rybka 3 Dynamic 3284 892

39 Rybka 4 x64 Exp 2 v1 3284 472

40 Rybka 4 x64 Beta 15 v1 3284 426

41 Rybka 3 3283 1897

42 Rybka 4 x64 Exp 12 v1 3283 900

43 Rybka 3 Human 3281 1461

44 Rybka 4 x64 Exp 17 v1 3271 355

45 Stockfish 1.8 JA 64bit 3258 3131

46 Rybka 4 x64 Exp 6 v1 3259 121

47 Deep Rybka 4 x64 Human 3257 174

48 Stockfish 1.7.1 JA 64bit 4t 3237 4235

49 Stockfish 1.7 JA 64bit 4t 3224 560

50 Stockfish 1.6.3 JA 64bit 3207 384

51 Critter 0.80 64-bit 3186 1800

52 Deep Fritz 11 3123 4774

53 HIARCS 12 MP 3119 4183

54 Deep Shredder 12 3115 1824

55 Naum 4 4t 3108 4311

56 spark-0.4 3107 1802

57 Zappa Mexico II 3079 2519

58 Critter 0.70 64-bit 3078 2326

59 Deep Shredder 11 UCI 3064 2410

60 Komodo64 1.1 JA 3057 3238

61 Komodo64 1.0 JA 3054 407

62 bright-0.5c 3035 2350

63 Protector 1.3.3 x64 4t 3035 782

64 bright-0.4a 3020 21

65 Protector 1.3.6 x64 4t 2977 900

66 Thinker54AInert-MP64-UCI 2967 900

67 spark-0.3a 2935 23

68 Deep Fritz 12 2659 385

1 Rybka 4 x64 Exp 31 v1 3354 1153

2 Rybka 4 x64 Exp 30 v1 3349 1153

3 Rybka 4 x64 Exp 37 v1 3349 1151

4 Rybka 4 x64 Exp 38 v1 3349 1150

5 Rybka 4 x64 Exp 34 v1 3345 1153

6 Rybka 4 x64 Exp 36 v1 3343 1151

7 Rybka 4 x64 Exp 24 v2 3341 1331

8 Rybka 4 x64 Exp 33 v1 3339 1153

9 Rybka 4 x64 Exp 26 v2 3338 1329

10 Rybka 4 x64 Exp 32 v1 3338 1153

11 Rybka 4 x64 Exp 35 v1 3336 1152

12 Rybka 4 x64 Exp 28 v1 3335 1327

13 Rybka 4 x64 Exp 25 v2 3334 1330

14 Rybka 4 x64 Exp 11 v1 3333 454

15 Rybka 4 x64 Exp 27 v1 3328 1328

16 Rybka 4 x64 Exp 18 v1 3326 3039

17 Rybka 4 x64 Exp 22 v1 3323 2231

18 Rybka 4 x64 Exp 29 v1 3322 1327

19 Rybka 4 x64 Exp 9 v1 3319 434

20 Rybka 4 x64 Exp 23 v1 3318 2178

21 Rybka 4 x64 Exp 8 v1 3314 1315

22 Rybka 4 x64 Exp 14 Human 3305 900

23 Rybka 4 x64 Exp 7 v1 3303 815

24 Rybka 4 x64 Forum v1 3302 1160

25 Rybka 4 x64 Exp 10 v1 3302 462

26 Rybka 4 x64 Exp 4 v1 3299 900

27 Deep Rybka 4 x64 v1 3299 161

28 Rybka 4 x64 Exp 20 v1 3299 590

29 Rybka 4 x64 Exp 15 v1 3297 1800

30 Rybka 4 x64 Exp 1 v1 3298 916

31 Rybka 4 x64 Exp 3 v2 3297 423

32 Deep Rybka 4 x64 3294 1986

33 Rybka 4 x64 Exp 21 v1 3295 900

34 Rybka 4 x64 Exp 16 v1 3295 557

35 Rybka 4 x64 Exp 19 v1 3290 189

36 Deep Rybka 4 x64 Lasker 3286 479

37 Rybka 4 x64 Exp 13 vC13510 3284 750

38 Rybka 3 Dynamic 3284 892

39 Rybka 4 x64 Exp 2 v1 3284 472

40 Rybka 4 x64 Beta 15 v1 3284 426

41 Rybka 3 3283 1897

42 Rybka 4 x64 Exp 12 v1 3283 900

43 Rybka 3 Human 3281 1461

44 Rybka 4 x64 Exp 17 v1 3271 355

45 Stockfish 1.8 JA 64bit 3258 3131

46 Rybka 4 x64 Exp 6 v1 3259 121

47 Deep Rybka 4 x64 Human 3257 174

48 Stockfish 1.7.1 JA 64bit 4t 3237 4235

49 Stockfish 1.7 JA 64bit 4t 3224 560

50 Stockfish 1.6.3 JA 64bit 3207 384

51 Critter 0.80 64-bit 3186 1800

52 Deep Fritz 11 3123 4774

53 HIARCS 12 MP 3119 4183

54 Deep Shredder 12 3115 1824

55 Naum 4 4t 3108 4311

56 spark-0.4 3107 1802

57 Zappa Mexico II 3079 2519

58 Critter 0.70 64-bit 3078 2326

59 Deep Shredder 11 UCI 3064 2410

60 Komodo64 1.1 JA 3057 3238

61 Komodo64 1.0 JA 3054 407

62 bright-0.5c 3035 2350

63 Protector 1.3.3 x64 4t 3035 782

64 bright-0.4a 3020 21

65 Protector 1.3.6 x64 4t 2977 900

66 Thinker54AInert-MP64-UCI 2967 900

67 spark-0.3a 2935 23

68 Deep Fritz 12 2659 385

I suspect that if you ran this gauntlet again (i.e. another 900 games), you would find that the ordering of the R4 engine variants has no statistical significance and that the only thing you can ascertain from the games is that R4 with 4 threads is much better than Komodo on one thread.

I believe I said something like that. In itself it is not very meaningful as each mini-match is only 100 games but together with the other ten opponents and their one hundred games verse each engine and you do get something more reliable. There is information but it only shows itself when it comes together when many games are collected like puzzle pieces. Each piece by itself makes little sense but together they reveal a picture.

Hello mindbreaker

after all of your pioneer-job in setting tests, I wonder why you dont post the parameters. So we could try to reproduce ,or even benefit from your work.

I also wonder that nobody seems to be courious what the settings are .

Would you share some of your best settings with us?

Kind regards, Clemens

after all of your pioneer-job in setting tests, I wonder why you dont post the parameters. So we could try to reproduce ,or even benefit from your work.

I also wonder that nobody seems to be courious what the settings are .

Would you share some of your best settings with us?

Kind regards, Clemens

http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?pid=277108#pid277108

*"And at the end at 1200 games per engine or less if I end it sooner, I plan to post the parameters and final results."*
Exp 24 has been posted; it is pretty decent according to my tests: http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?pid=275286;hl=exp

And I posted some earlier ones too.

The whole collection will be posted soon.

And I posted some earlier ones too.

The whole collection will be posted soon.

As I have reached 1150 games and will likely do more than the 1200, I thought I should go ahead and post all the parameters even though it is not quite complete. So here it is...attached.

Attachment: MindbreakerR4Experiments.xls (29k)

Hello mindbreaker

thank you for your parameter file. I will try some of them and report here about.

Have a nice day

Clemens

thank you for your parameter file. I will try some of them and report here about.

Have a nice day

Clemens

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill