Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / L3 cache, RAM and other performance factors
1 2 Previous Next  
- - By Nimzy (*) Date 2016-12-04 15:21
As has already been documented, for infinite analysis:
1) Doubling #of cores gives on average 70% extra (usually)
2) Scaling with amount of RAM is not quite documented, but perhaps almost linear? i.e. 10 times as much is 10 times faster if the analysis time is really long? Maybe a formula based on different speeds and probabilities can be devised?
3) Overclocking a CPU gives linear feedback on the increased speed (of course)
4) After this it seems that L3 cache might be the most important factor, any idea of how big the difference is? Maybe a formula can be devised based on #of cache hits/misses...The range of different standard CPU's between 4MB-20MB is quite large...

Also I suppose other factors such as uncore speed and RAM speed maters, but only a very small impact?

Any input about any of the factors above is appreciated :)
Parent - - By Antares (****) Date 2016-12-04 18:56
Nimzy-buddy, how are the chicks over at IKEA-land? Regarding...

1) Depends on the engine, and obviously, with every #core-doubling [especially with #processor-doubling or even #system-doubling] it degresses more and more due synchronization-overhead... for single-processor computers, 70% is a solid rule of thumb though.

2) Maybe when you dunk it into snake oil... given a standard size (4GB) and speed (1600GHz), doubling RAM(= hash)-size or -speed will give you very overseeable [maybe higher single-digit] gains.

3) Not linear, but near to it.

4) Correct, speed&size of the whole cache-level [in general] is way more important than 2) as every cache-miss is costly... and as it is obviously dependent on how the engine organizes its calculation (optimized for cache-locality = less misses), expect gains above but nearer to 2) than linear.

= Buy the fastest core'ist CPU you can afford, >=16GB RAM, reasonably overclock and have fun.
Parent - - By Sesse (****) Date 2016-12-04 23:38
Does the L3 size matter much? One would imagine the hash table is near 100% cache misses anyway (the difference between 99.9% and 99.8% misses is basically nil), and the rest of the stuff should fit quite comfortably into L2? Well, perhaps the pawn hash…

/* Steinar */
Parent - - By Vegan (****) Date 2016-12-05 00:56
L1 caches are the fastest and L2 was added to deal with slower memory speeds
L3 was added to help with multicore processors

Adding more L3 has diminishing returns which is why modern CPU designers have added graphics instead of more L3 etc
Parent - - By Nimzy (*) Date 2016-12-05 06:54
Thanks, so you don't think there is any practical difference between 35MB for Xeon E5-2680v4 (or the 15MB for 6800K) compared to the standard 6600K with 6MB? And what about 128MB of L4 cache then vs no L4 cache?
Parent - By Vegan (****) Date 2016-12-05 07:27
server CPUs have more L3 mostly to speed up the hypervisor

L4 is really embedded DRAM which is what some system in a chip designs use
Parent - By Sesse (****) Date 2016-12-05 08:27
The L4 cache is mainly there for the embedded GPU, which is why it's only on the Iris Pro. The iGPUs share memory subsystem (down from the L3, IIRC) with the CPU, which means that the main memory bus rather quickly becomes a bottleneck (GPUs can tolerate latency much better than a CPU can, but also can initiate tons of more memory transfers). The L4 cache is there to take some of that load off, but it doesn't help much with latency, so it's not that useful for the CPU.

/* Steinar */
Parent - By Sesse (****) Date 2016-12-05 08:23
You're confused (and that statement can safely be put in the L1 cache).

/* Steinar */
Parent - - By Antares (****) Date 2016-12-05 09:09
No, in general it doesn't matter much, but relatively a little more than RAM-size&-speed. At ages of Rybka4/Houdini1.5/RobbolitoX.Y i tested a 2600K_8MB@4GHz(flat|no-turbo) vs. 2500K_6MB@4GHz(flat|no-turbo), with the bigger one [with it's 2MB L3-cache more] being ~5% faster for infinite analysis (so we're speaking about lower double-digits when hypothetically doubling from 6MB to 12MB). Obviously this is highly dependent on how the application(/engine) is optimized making use of the CPU's cache-&memory-hierarchy.

With engine's data-structures being optimized for [cache-]memory-locality and nowadays pretty intelligent cache[-prefetch]-algorithms within the CPU, costly cache-misses actually are pretty rare compared to the -hits.
Parent - - By Sesse (****) Date 2016-12-05 09:28
In chess? No, cache-misses are the norm for the hash table, and will always be. However, there's of course a lot of references to other stuff (the current position, for instance). A quick demonstration below, showing 60%+ cache misses in a simple one-core search:
pannekake:~/nmu/stockfish-grpc> sudo perf stat -e cache-misses,cache-references ./src/stockfish
Stockfish 201116 64 BMI2 by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
setoption name Hash value 4096
go depth 20
info depth 1 seldepth 1 multipv 1 score cp 90 nodes 20 nps 625 tbhits 0 time 32 pv e2e4
[more stuff]
info depth 20 seldepth 29 multipv 1 score cp 40 nodes 3927830 nps 346461 hashfull 6 tbhits 0 time 11337 pv e2e4 e7e5 g1f3 g8f6 f3e5 d7d6 e5f3 f6e4 d2d4 d6d5 f1d3 b8c6 e1g1 f8e7 c2c4 c6b4 c4d5 b4d3 d1d3 e4d6 f1e1 e8g8
bestmove e2e4 ponder e7e5
quit

 Performance counter stats for './src/stockfish':

       184.514.495      cache-misses              #   38,959 % of all cache refs    
       473.608.408      cache-references                                            

      19,800812790 seconds time elapsed
Parent - - By Antares (****) Date 2016-12-05 10:34
I referenced to your above stated opinion of "One would imagine the hash table is near 100% cache misses anyway", which is obviously far of the 38,959% [for a worst case?! starting-postion] you actually measured (aren't the 60%+ the cache hits?!) - hell yes, for the data-structures of the whole engine obviously... as you pointed out, additional homework would be necessary to seperate the cache hits of the whole engine from the "hash table"-only ones. :smile:

Can you please run a higher depth run of the following position, so we can get a first idea of engine-startup&starting-position overhead:

setoption name Clear Hash
position fen
1kbr3r/2p2q2/np1p1bpp/1NnPp3/p1P1P3/P3B3/KPQNBPP1/3R3R w - -


Thanks.
Parent - - By Sesse (****) Date 2016-12-05 11:36
You missed the point. :-) Sure, there are 60% cache hits, but those are all about things that are not the hash table. Do note that once you've pulled the cache line into memory, you have three entries in it (for Stockfish), so the remaining accesses, including the final store of the position will of course be counted as a “hit”. (You won't really see this effect clearly in a profile unless you unroll the probing loop by ClusterSize.) 60% is a really low hit rate for a regular program, which means there's going to be tons of memory traffic. Essentially it means that the L3 cache isn't doing much good; the hash table is just too big and too randomly read.

There's no extra overhead in the starting position (although of course there's a meaningful startup cost for e.g. zeroing out the table at the start of the search), so trying a different position is meaningless.

/* Steinar */
Parent - - By Nimzy (*) Date 2016-12-06 07:26
Thanks for your input :)
So how many % would you say is representative? :) Do I think correctly that the Hash table becomes more and more important for each added ply and after a very very long time then it is very important to store as much information as possible? Is there somebody who has tried storing it on SSD or is that not reasonable, after say 12hours of calculation?
Parent - - By Sesse (****) Date 2016-12-06 08:41
I'm not sure if I understand your question. Representative of what?

As for the ideal hash table size, I believe this is generally an open question. Storing it on SSD sounds extreme, though.

/* Steinar */
Parent - - By Nimzy (*) Date 2016-12-06 16:00
Sorry, I meant standard amount of cache hits. Is the 38,959% given below reasonable? During a long analysis session? Maybe the question is a bit weird, since I guess it depends on amount of RAM, but let's say that the amount is infinite (which simulates a very short analysis time)
Parent - - By Sesse (****) Date 2016-12-06 20:30
It's reasonable for chess, but for a regular program, it would be pretty bad. Just think of rough numbers: Say you have a memory access every fifth instruction, and that a L3 miss costs 200 cycles. So every tenth instruction (since you have roughly 50% hit rate) you get a 200-cycle delay, which means you get to do only one instruction every 20 cycles! You could just as well clock your CPU down from 2 GHz to 100 MHz then, without losing anything.

(Chess is different because you have much more work between each memory access. It's very ALU heavy with all the bitboard operations.)

/* Steinar */
Parent - - By Nimzy (*) Date 2016-12-07 06:44
I see!

And I also see that I messed up half of my question, RAM has of course nothing to do with cache hits, only hash table hits. Do you have an idea how often a hit in the hash table occurs? And how much the gain in time is compared to re-evaluation of position?
Parent - By Sesse (****) Date 2016-12-07 08:38
I don't know exactly how often, and it's going to vary a lot with different factors (in particular, more important hash table entries will be found more often), but the gain is huge. Not the least because the hash table contains information that helps with move ordering, and a good move ordering is paramount for both alpha-beta itself and things like late move reductions.

/* Steinar */
Parent - - By Antares (****) Date 2016-12-07 09:17
Nimzy-Buddy, are all girls in Sweden partysome like this? :grin:



> And how much the gain in time is compared to re-evaluation of position?


The gain can be [theoretically near] infinite! When you use an engine which provides you with a persistant hash feature (which should provide you the possibility of saving the hash-table to harddisk for a later resumption of analysis; funny side note: with your virtual memory SSD-idea you would just have to shut down "the virtual" to have it already saved on [and enable to reloaded from] harddisk :grin: - given there are no additional header-bytes of the engine added of course), and you analyze to depth 43[, "save it"] and restart, you will realize that the engine quickly jumps to depth 43, which saves you recalculating the first 42[+] depth levels... which btw. in average is around the time needed to calculate you depth 43, or half of depth 44, or a quarter of depth 45...

> Do you have an idea how often a hit in the hash table occurs?


This is [obviously] highly dependent on the engine, the size of the hash-table and even worse the position at all (a pretty famous position for a high hash hit-rate is Fine's #70). For the pawn hash, which is only a few KB big, you can expect rates of 95% to even 99,x%..., for general hash table we can speak more about lower double digits... just realize, that a modern single-processor computer calculates you ~20.000.000 nodes/sec, and with needing a few bytes for storing each node in hash, you basically have [i have taken 10 bytes/nodes for simplification] ~GBofRAM_x00.000.000 seats available... so expect a high turnaround-rate.
Parent - - By Nimzy (*) Date 2016-12-07 20:32
You have to visit and see for yourself! But I guess - if it's really warm or cold.

I did a test run now actually with 8GB vs 1 GB Hash size and according to Stockfish 8 after 10h I got a speedup of at most 1%. Now I didn't have large pages but still. I doesn't seem like that much :)
But that's just the "nps", maybe that's not an approriate meassure?
In the given position (I know, 1 position is way too little), I got to depth 65 in 38695s for 8GB but 49563s for 1 GB. So maybe the speedup is more like 25%? Which is the best way to meassure strength (of those 2, I guess actually playing games is the best).

"the first 42[+] depth levels... which btw. in average is around the time needed to calculate you depth 43"

> That's not really what I'm getting for Stockfish (for a few differet positions), it's much closer to time between 41 & 42 = half of time between 42 & 43? Are you sure about this?

Parent - - By Antares (****) Date 2016-12-07 21:54


The problem of testing chess-engines reaching a certain depth-level on modern multicore-processors&-systems is in its non-deterministic execution-behaviour: Every run looks different, and factor >=2x reaching a certain level and/or reseaching another primary move is always possible and in this exact run not reproducible (best what you can do is restart engine and/or clear hash, and attach engine-branches to constant cores (task-manager...)). For a clearer picture, i would advice selecting a much lower target-depth (middle 3x or low 4x), taking a lot runs for a single position and doing this with several other positions as well... the average here will tell you more. Another possibility is searching single-core only [again the engine being bound to a constant core].

> I guess actually playing games is the best


When its your aim, yes, but when infinite analysis is your aim, playing [blitz-]games is MickeyMouse-stuff... given corr-timeframes, playchess-games are simply full of mistakes and worthless for serious analysis (beyond opening-research of course).

In 2011 chess-engines (Rybka, Houdini, Robbolito, Hiarcs...) needed around the time they needed reaching up to depth 'n-1' to calculate depth 'n'... with highly prunning engines like Stockfish in '16 this indeed is often not the case, yet there can be instances of several "+"- or "-"-levels which exponentially can increase the needed time at any given depth-level.
Parent - - By Nimzy (*) Date 2016-12-09 07:15
Is that from a soccer game?

But you think depth is a better measure than nps? How do they measure elo gains in the fishtest?

I also noticed that the nps tends to increase for each ply by a percent or so, any idea why?
Parent - - By Antares (****) Date 2016-12-09 08:54
Nimzy-Buddy,



> Is that from a soccer game?


No, just a few snapshots out of my dreams...

> How do they measure elo gains in the fishtest? But you think depth is a better measure than nps?


They test it by playing a lot [bullet] games against the old stockfish. You can do this as well, playing one engine with 8GB hash against a copy of it with 1GB at any time-control you desire... obviously, this primarly gives you conclusions about game-play (few [mill]seconds or minutes per move) and not too much about infinitive analysis (several hours to days a move) where some other rules apply. With most engines generating higher nps, higher depths are obviously reached in average[!] in less time - obviously you need to test a few positions a lot of times, calculating an average, experiencing this... there are exceptions with engines, who use additional resources (threads) to especially widen its tree[-search] regarding alternatives (Komodo [did this in former versions]).

> I also noticed that the nps tends to increase for each ply by a percent or so, any idea why?


Less pieces on board in search for easier calculation, less hash-turnaround and/or tb-hits, less starting-/orginization-overhead, your CPU-muscles gets warm... :grin:

Obviously we're waiting for your test-results, you're our man!
Parent - - By Nimzy (*) Date 2016-12-10 14:50
You seem to have a really vivid imagination!

I'm doing the testing right now :), but without having too many test runs it does seem that the longer you run the better larger hash tables are (which seems logical). But it will take a few weeks to get an accurate result for very long analysis times :) How much "elo-strength" do you usually consider to add for each depth (or doubling of analyis time)? I started reading http://web.ist.utl.pt/diogo.ferreira/papers/ferreira13impact.pdf. They got a result of 66 elo points per unit of depth, although that was simply until depth 20.

Also I looked at the source code for Stockfish (mainly tt.cpp and tt.h) and to me the padding they do to align to cache line size seems like something that could be used more optimized :). But that's just from a theoretical point, might be hard to actually do anything useful... And also I'm thinking of changing the memory allocation to allow differently sized Hash tables (not just a power of 2). Probably a bad idea, but why not test it :)
Parent - By Antares (****) Date 2016-12-10 15:27
Nimzy-buddy,

can you please give every girl [<= age of 25] you meet till Christmas my nick here @rybkaforum? I want them to pm me, so they can test my engine!



> it does seem that the longer you run the better larger hash tables are (which seems logical)


Obviously that's the case, the more practical question is: Where is your personal breakeven-point regarding price/performance-gain, and always remember: Less RAM-slots populated can often be better timed and/or overclocked. The same question goes for the CPU/cores as well...

> They got a result of 66 elo points per unit of depth, although that was simply until depth 20.


Just remember that depth-levels of certain engines are not really comparable, i would always suggest calculation-time being the most fair&comparable factor.

There are some very intelligent people behind Stockfish, but obviously, as it is "optimized" being very portable, there is always some room for speed (asmFish, which may be buggy for some?!)... also here it is a development/testing-time to performance-tradeoff... in life you always have to give something to gain something else... i would give my pm and a decent check donated to your name for a contact-list of some cute Swedish girlz... :grin:

> And also I'm thinking of changing the memory allocation to allow differently sized Hash tables (not just a power of 2)


Yes, with 16GB RAM you can easily try 12GB hash.

Far&test well buddy.
Parent - - By Antares (****) Date 2016-12-16 22:20
Nimzy-buddy,



please[!] tell us about your test-experiences&-results.



Best regards,
Antonius, a [high-potent!] loova of Swedish chicks&girls.

Parent - - By Nimzy (*) Date 2016-12-16 22:43
Wow, you must have vivid imagination ;)

Well, no conclusion yet, testing is still undergoing, but I did realize that sometimes it may take about 2x sum of all previous plies for a ply. And sometimes it takes like a tenth. You know how this is chosen in for example Stockfish?
Parent - - By Antares (****) Date 2016-12-18 11:22
Nimzy-buddy,



as i said above, "2x sum of all previous plies for a ply" is/was the norm (as the amount of needed calculations nearly increases to the power of 2), nowadays due aggressive prunining and other "intelligent tree-work" engines find shorter ways to deeper levels. Due the random nature of "multithread-processing", every run looks a little different...
Parent - - By Nimzy (*) Date 2016-12-27 20:23
Wow, are you googling them or something? ;)

I know you said it before :) This analysis is really taking alot of time though :)
Extensive testing of multiple positions clearly show that for 3days analysis the gain of using a lot more RAM (10x) is maybe about 20% (very rough estimations). So very close to your estimation? But I still think for longer analysis, such as 15days+ it would *perhaps* mean at least 100%? And for 100days+ analysis maybe even 10x faster. Do you agree on this theoretical analysis? :) And due to diminishing returns 100days analysis would be 5 elo points better than 10 days ;). I couldn't find any serious work about diminishing returns above depths of 20-30. Which is strange, wouldn't somebody look at it?

Some other questions have popped up though :) I was looking at https://sites.google.com/site/computerschess/stockfish-chess-benchmarks and if we compare:
45773  36   x64   2x Xeon E5-2696 v3     @2.80GHz    Yorkman (the top one)
with for example the standard:
13797   4   x64   Intel Core i5-6600K    @5.00GHz    Dark_Wizzie
Wouldn't the difference actually be much smaller due to inefficiency of parallel alfa-beta search? According to another thread "Houdart believes that search inefficiencies with doubling of threads is about 20 percent. " (and I think this is in line with other discussions about hyperthreading where they want 30%+ N/s to actually turn it on). Maybe the value is slightly different since the introduction of LazySMP for Stockfish but I can't find an actual estimate of it anywhere..
The comparison for actual use would be: 0.8*0.8*0.8*45773=23435
So isn't the comparison skewed, or am I missing something?

And a third question :)
I understand that AsmFish is faster since it is in assembler, CFish was also faster due to NUMA support earlier but now that standard Stockfish has it is not(?) But why is the compiles on http://chess.ultimaiq.net/stockfish.html 8,5% faster than the abrok Stockfish compiles? Of course that may happen if you aren't using the fastest c++ standard libraries for the standard compiles, but why wouldn't the Stockfish team use them?

I was also thinking about the importance of avoiding hash collisions for greater depths which may mean that the current hash value size for Stockfish is not optimal at longer time controls. Don't know if somebody else has investigated this lately? But that would require some major analysis work (either theoretical or practical)...
Merry Christmas! :D
Parent - - By Antares (****) Date 2016-12-30 12:51
Dear Nimzy-buddy,

how far are you away from Stockholm?



> Extensive testing of multiple positions clearly show that for 3days analysis the gain of using a lot more RAM (10x) is maybe about 20% (very rough estimations). So very close to your estimation?


Thanks for your report! So you see, this is far of what "our" Taiwanese friend claimed&tested, and indeed very close to my estimation. :lol:

> But I still think for longer analysis, such as 15days+ it would *perhaps* mean at least 100%?


NEVER in a cold frozen day in hell it would mean that. Did you take other numbers at a few hours, half day, one and two days? I expect pretty much absolutely no real [life] change beyond that 20% after three days (remember, every run looks a little different...). Beyond that, as i played (and won most, not losing a single...) a few corr-games as well, never ever did i have the impression that an analysis beyond three days, without manual interaction, makes sense at all (i have written above something about analysis already).

> And due to diminishing returns 100days analysis would be 5 elo points better than 10 days ;). I couldn't find any serious work about diminishing returns above depths of 20-30. Which is strange, wouldn't somebody look at it?


Simply said: Someone running analysis for a few days is indeed interested in serious analysis, willing to manually guide the path... this can more or less lead to a few hundred more elo if you will (current ICCF-WCH dragonmist wrote about that in his chessbase-articles)... simply because most often you can outplay an unassisted engine if you want&need!

> So isn't the comparison skewed, or am I missing something?


Aside that i would be careful with "Dark_Wizzie"-results :lol:, it's more like 13797 * 1,6 [doubling to 8, minus 20%] * 1,6 [to 16] * 1,6 [to 32] * 1/8th*1,6 [from 16 to 32] = 56513 (+ the other 4 cores to 36), which is pretty reasonable given that a dual socket is way more complex and the Xeons are clocking way lower.

Maybe the guys/dragonmist can answer your question regarding stockfish-compiles.

> that the current hash value size for Stockfish is not optimal at longer time controls.


Indeed most engines, most noteably Houdart's Houdini in earlier versions, are/were optimized for quick engine-games, and there can/could be some useful tweaks for serious analysis.

Far well buddy!
Parent - - By Nimzy (*) Date 2017-01-05 18:58
Are you coming to Sweden? :) It is cold now though...

> Did you take other numbers at a few hours, half day, one and two days? I expect pretty much absolutely no real [life] change beyond that 20% after three days (remember, every run looks a little different...). Beyond that, as i played (and won most, not losing a single...) a few corr-games as well, never ever did i have the impression that an analysis beyond three days, without manual interaction, makes sense at all (i have written above something about analysis already).


Well I agree with you on the real life thing, but I still think that from a theoretical viewpoint when for example each entry in the Hash table is worth a lot, approaching infinity as analysis time approaches infinity. From a theoretical viewpoint that is, from a practical viewpoint you are of course right.

I'm thinking of testing diminishing returns by comparing difference between d=19,20 vs difference between d29 and d30. My calculation said that a few hundre games would take a few weeks, so not right now :)

> Aside that i would be careful with "Dark_Wizzie"-results :lol:, it's more like 13797 * 1,6 [doubling to 8, minus 20%] * 1,6 [to 16] * 1,6 [to 32] * 1/8th*1,6 [from 16 to 32] = 56513 (+ the other 4 cores to 36), which is pretty reasonable given that a dual socket is way more complex and the Xeons are clocking way lower.


To be honest, I'm not sure about your calculation. The loss for each doubling of cores is 20%, so to compensate for that (to be able to compare them) 4cores @ X N/s would be equal in strength to 8 cores @ 1.25 X N/s. Which is what I wrote (except comparing it the other way around). Comparing with 2*0.8 (which you did) only makes sense if you do an actual doubling of cores (and not for comparing them). I hope I don't offend you - it is not my intention! :)

I've been reading up on LazySMP and realized depth testing with different amount of cores is not good either (since with more cores the search Tree gets thicker than normally). So matches with fixed times (like humans do) seems the way to go to compare different aspects, but with 500 games to get statistically assured values it will take ages :) One conclusion from all this testing is at least that AWS EC2 is cheap but cheap * a lot of time = not so cheap :)
Parent - - By Dark_wizzie (***) Date 2017-01-27 10:26
This is actually pretty off topic on more than one level, but I've submitted my Kaby Lake bench at Sedat's site already (7600k). Note that I had a lucky run on my 6600k and its score is a bit higher than it probably should have been.
Parent - - By Nimzy (*) Date 2017-01-28 10:30 Edited 2017-01-28 10:39
Haha, maybe off topic, but the name of the thread is "and other performance factors" and clearly both RAM and L3 cache wasn't as important as I thought at the beginning.

Thanks for the benchmark, and good job on the high overclock (is it air or liquid? What MB and RAM are you using?)! But the IPC of Kaby Lake really seems bad? At least for chess, 13797/5.0Mhz=2759,4 while 14102/5.35Mhz =2635.9. So IPC is not better for Kaby Lake at all? Almost looks much worse but maybe that is just due to variation in the testing method? How much higher do you think the i5-6600K result is than it should have been? If we extrapolate the result for the 3.70GHz i5-6600K we get 9625*5/3.7=13006. And even if that was the correct result then 13006/5=2601 which is about 1.3% slower than 6700K.

But maybe the 7600K is easier to overclock, what do you think? Are you getting the Ryzen?
Parent - - By Labyrinth (*****) Date 2017-01-28 12:06
There is no IPC difference between Kaby lake and Skylake. Kaby just features higher clocks due to a slightly different process.
Parent - - By Nimzy (*) Date 2017-01-28 15:14
True, I had read somewhere that it was like 1-5% but guess that wasn't true. Looking at it now it seems like improvements over sandy bridge are very slim: from januari 2011 to 6 years later (and 5 generations) - about 15-20% in total for general applications.

For chess it seems higher than that, probaby because of bmi and SSE instruction sets? I've been looking at the AVX512 soon to be launched (next generation) but I can't figure out if that will yield an improvement...
Parent - - By Dark_wizzie (***) Date 2017-01-29 09:18 Edited 2017-01-29 09:22
Hello,
Performance per clock is a general metric that can vary depending on what we're testing. In my limited experience just benching Haswell vs Skylake the differences in IPC are a bit on the lower side compared to some of my other applications. Also IIRC speedup of BMI2 is like 1-3%. Of course, BMI2 worked with Haswell too. It is true that Kaby Lake features no IPC improvement over Skylake. It is a refinement in process that results in higher clocks (along with misc features with associated chipset that doesn't really affect performance). Recall that Devil's Canyon had same IPC as Haswell, it just clocked better. Since Sandy improvement after each generation decreased a lot. From Sandy to Haswell was the worst I think. An average Haswell was barely faster than average Sandy due to poor overclocking, but in this case we have instruction sets to think about as well.

More specifically on the discrepancy in results over my Skylake bench: As I mentioned previously, I got lucky with my Skylake bench. In that case the engine decided a different move and that caused the engine to think longer than it normally would. But, Sedat accepted it anyways. All of my future benches will not be like this for consistency. My original mentality when submitting that was to grab the highest possible result I could no matter what. The side effect is it made Kaby Lake look worse than it really is. I suggest you take my Kaby Lake result and divide it by the correct number so you know the kn/s @ 5ghz and assume that was my normal bench on Skylake.

As for my system: I bought a binned and delidded chip from Silicon Lottery. Ram is 2x8gb 3804 15-15-15-32. The idea of cache and performance was one I was pondering over as well, since one difference between 7700k and 7600k was more cache. In the end I decided the difference is too small to justify a huge price difference with binned chips. I am running air cooled via D14. When it comes to Ryzen I have to say: These days I do not do a lot of chess. My main priorities are gaming, and most of my games are CPU limited because they use only a few cores. This is why consumer Kaby Lake becomes the fastest CPU for my purposes. This is also why Ryzen does not fit my use case. That said I expect Ryzen to be a great deal for chess fans. Even 8350 held out for a very long time with good price/performance.

Refer to my overclocking charts for reference on how much these chips clock on average:
https://docs.google.com/spreadsheets/d/1wQwYMGsSnMpKxrEesNVUSoP7hGykFWw1ygJsxwx64e8/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1NoxceLMU9dnVev8QmYmBT16fnjwrGkwdIRjTQzzKaVk/edit?usp=sharing

On SSD, my own testing showed that my 950 Pro did not performance significantly better than my 850 Pro when it comes to kn/s with heavy TB hits. I think my testing was done correctly but it may not. Too bad traced based analysis is out of reach to the common man...
Parent - - By Nimzy (*) Date 2017-02-02 17:36
Thanks for the explanations! Wow - that is some serious data in those files!
So average on those were 5.05Mhz for Kaby and 4.70Mhz for Skylake. That may seem convincing, but since the sample size is so different (14 vs 142) maybe only the "best" OC aquired the Kaby Lake yet so that is why the result is higher? Or coolers are getting better? But anyway - that is probably not the case and that probably won't explain much of the difference anyway :)
Parent - By Dark_wizzie (***) Date 2017-02-06 09:48
Kaby Lake is new, people haven't had a chance to submit their overclocks. From experience and from Silicon Lottery's website I can tell you that the 4.7 vs 5ghz jump is about right.
Parent - - By AU (**) Date 2017-01-29 07:34

> Wouldn't the difference actually be much smaller due to inefficiency of parallel alfa-beta search? According to another thread "Houdart believes that search inefficiencies with doubling of threads is about 20 percent. " (and I think this is in line with other discussions about hyperthreading where they want 30%+ N/s to actually turn it on). Maybe the value is slightly different since the introduction of LazySMP for Stockfish but I can't find an actual estimate of it anywhere..


I don't think you actually tried. For Stockfish 8, the link suggests Amdahl's law with, empirically, 4.5% non-parallelizable part.
Parent - - By Nimzy (*) Date 2017-02-02 17:30
Thanks very much for that link! No I never tried that combination (and I didn't say I did did I?) I had actually read if before (although I missunderstood Lucas comment in http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=31863;pg=1 which was not about those measurements).
But still - comparing 4 to 32 cores for which I used 0.8*0.8*0.8=0.51 as the relative performance factor - with these new LazySMP going from 4 to 32 gives a relative factor of 0.48 (actual speedup/expected "NPS" speedup=(13.36/3.52)/8=3.795/8=0.47)
That comparisson would then yield 0.47*45773=21513 which is just 56% higher than the 6600K.
And the fact that the measurements were even 36 cores makes it even worse? So the playing performance for the 2x Xeon E5-2696 v3 should be about 50% better than the OC 6600K? I can't imagine what the prize difference is, but maybe a factor 5 if everything is bought new?
Parent - By Nimzy (*) Date 2017-03-03 19:19
I have been thinking about this and I dont see why this can be extrapolated for different depths.
For deeper analysis, the positions (the leafs) that are being judged by different cores in the LazySMP implementation are further apart (i.e less likely to overlap since the calculations are independent) I would think? But maybe I just haven't understood the iterative deepening of tha alpha-beta algorithm.
Parent - - By Antares (****) Date 2016-12-06 09:26 Edited 2016-12-06 09:29
Nimzy-buddy, i just thought/dreamed about AceOfBase-Jenny... how sweet she was when young, and now, well, she is 44... yeah.



> So how many % would you say is representative? :)


While real-life numbers of an engine's cache-misses may be [well] below the above quicky-"tested" 38,959% (no clear hash, [a single] starting postion, just 19,x sec runtime...), you have to expect that cache-misses are relatively high compared to more standard (more memory-localized operating) applications... but well, what does it matter, what can ya do? Right, get the best engine-compiles, buy the best [desktop-]CPU and reasonable clocked&sized RAM, overclock a little and see what you get... a big and costly server-CPU is definitely not worth price/performance just for its bigger L3-cache (but maybe for its additional cores, in case you want afford it). Same goes for RAM: Prefer size over (overclocking-)speed, given a reasonable number [16GB, 1600GHz], doubling both will get you far off the performance gains promised by your stated Taiwan snakeoiler... but when you want to afford it, why not (it's like sports cars...).

> Do I think correctly that the Hash table becomes more and more important for each added ply and after a very very long time


Yes, when you carefully read experiences of the other users here, you realize that there seems to be a pretty low "perfect" hash-size for bullet-games which increases heavily when you aim for [as you said 12h] infinite analysis... yet, from my practical experience analyzing corr-games, there is some sweet-spot for doubling search-time or hash-size where everything beyond leads you to very diminshing results, but you will see for yourself soon. Keep in mind,that a modern single-processor can calculate/lookup you ~20 000 000 nodes/sec, so you can basically expect the hash being flushed every few minutes completely = there is no practical hash-size where you can store all calculations of several hours let alone several days.

> Is there somebody who has tried storing it on SSD


I'm surprised Steinar didn't point out above, but obviously a SSD is absolutely not suited for this high number of write-cycles... it's primary use in chess is for tablebases (high number of read-cycles, [nearly] no write-cycles).

And can you please send a personal greeting to Jenbaby? :grin:
Parent - - By Sesse (****) Date 2016-12-06 11:27
The thing about wearing out SSDs is true but greatly overblown. You can pretty much write nonstop to a modern SSD for years before it wears out. For a busy production environment, this certainly matters (you don't like parts that only last 2–3 years), but for a regular user like you or me, I wouldn't worry.

/* Steinar */
Parent - - By Antares (****) Date 2016-12-06 12:30

> The thing about wearing out SSDs is true but greatly overblown.


For normal daily use yes, but using a SSD as virtual memory for the hash-table you can reach the ~hundreds Terabytes where customer SSD's start to fall within months:



http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead

//===> Antares <===
Parent - - By Sesse (****) Date 2016-12-06 12:40
Assuming you can actually write random data that fast, yes. The test you're linking to writes sequentially, which is a lot easier to fill.

People don't realize it, but flash does have “seek times” just like spinning drives. Except it's more per-request overhead/latency, and on good devices you can get around it by doing lots of simultaneous requests. On mobile flash, not so much. If you ever use the offline routing functionality on Google Maps (iOS/Android), spare a thought for all the I/O optimization we did to make it feel fast :-)

/* Steinar */
Parent - - By Antares (****) Date 2016-12-06 14:46

> People don't realize it, but flash does have “seek times” just like spinning drives.


Comparing sequential 4MB write speed with random 4KB write speed barely shows a speedup of ~1.5, most of it due first one's chunks being 1024 times bigger obviously... i don't think random-writes (as with a hash-table) will increase a SSD's lifetime noticably beyond a low factor 1_point_x.





//===> Antares proves the point <===
Parent - - By Sesse (****) Date 2016-12-06 14:58
But this is with several concurrent requests, right?
Parent - - By Antares (****) Date 2016-12-06 15:48
With our discussion about using a SSD as virtual memory for the hash-table: being one to two orders of magnitudeas slower than real RAM - and with all the cache-misses you "proved" above, you realize there will always be some [concurrent] hash-table entries waiting to be written, right?

(not to speak about a sequential 4MB-write being more "concurrent" than a 4KB-one can ever be?)
Parent - - By Sesse (****) Date 2016-12-06 15:57
OK, assuming you're fine with a out-of-order writes, which you probably are in chess.
Parent - By Antares (****) Date 2016-12-06 19:21
;-)
Up Topic The Rybka Lounge / Computer Chess / L3 cache, RAM and other performance factors
1 2 Previous Next  

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill