1) Doubling #of cores gives on average 70% extra (usually)
2) Scaling with amount of RAM is not quite documented, but perhaps almost linear? i.e. 10 times as much is 10 times faster if the analysis time is really long? Maybe a formula based on different speeds and probabilities can be devised?
3) Overclocking a CPU gives linear feedback on the increased speed (of course)
4) After this it seems that L3 cache might be the most important factor, any idea of how big the difference is? Maybe a formula can be devised based on #of cache hits/misses...The range of different standard CPU's between 4MB-20MB is quite large...
Also I suppose other factors such as uncore speed and RAM speed maters, but only a very small impact?
Any input about any of the factors above is appreciated :)
1) Depends on the engine, and obviously, with every #core-doubling [especially with #processor-doubling or even #system-doubling] it degresses more and more due synchronization-overhead... for single-processor computers, 70% is a solid rule of thumb though.
2) Maybe when you dunk it into snake oil... given a standard size (4GB) and speed (1600GHz), doubling RAM(= hash)-size or -speed will give you very overseeable [maybe higher single-digit] gains.
3) Not linear, but near to it.
4) Correct, speed&size of the whole cache-level [in general] is way more important than 2) as every cache-miss is costly... and as it is obviously dependent on how the engine organizes its calculation (optimized for cache-locality = less misses), expect gains above but nearer to 2) than linear.
= Buy the fastest core'ist CPU you can afford, >=16GB RAM, reasonably overclock and have fun.
/* Steinar */
L3 was added to help with multicore processors
Adding more L3 has diminishing returns which is why modern CPU designers have added graphics instead of more L3 etc
L4 is really embedded DRAM which is what some system in a chip designs use
/* Steinar */
/* Steinar */
With engine's data-structures being optimized for [cache-]memory-locality and nowadays pretty intelligent cache[-prefetch]-algorithms within the CPU, costly cache-misses actually are pretty rare compared to the -hits.
pannekake:~/nmu/stockfish-grpc> sudo perf stat -e cache-misses,cache-references ./src/stockfish Stockfish 201116 64 BMI2 by T. Romstad, M. Costalba, J. Kiiski, G. Linscott setoption name Hash value 4096 go depth 20 info depth 1 seldepth 1 multipv 1 score cp 90 nodes 20 nps 625 tbhits 0 time 32 pv e2e4 [more stuff] info depth 20 seldepth 29 multipv 1 score cp 40 nodes 3927830 nps 346461 hashfull 6 tbhits 0 time 11337 pv e2e4 e7e5 g1f3 g8f6 f3e5 d7d6 e5f3 f6e4 d2d4 d6d5 f1d3 b8c6 e1g1 f8e7 c2c4 c6b4 c4d5 b4d3 d1d3 e4d6 f1e1 e8g8 bestmove e2e4 ponder e7e5 quit Performance counter stats for './src/stockfish': 184.514.495 cache-misses # 38,959 % of all cache refs 473.608.408 cache-references 19,800812790 seconds time elapsed
Can you please run a higher depth run of the following position, so we can get a first idea of engine-startup&starting-position overhead:
setoption name Clear Hash
There's no extra overhead in the starting position (although of course there's a meaningful startup cost for e.g. zeroing out the table at the start of the search), so trying a different position is meaningless.
/* Steinar */
So how many % would you say is representative? :) Do I think correctly that the Hash table becomes more and more important for each added ply and after a very very long time then it is very important to store as much information as possible? Is there somebody who has tried storing it on SSD or is that not reasonable, after say 12hours of calculation?
As for the ideal hash table size, I believe this is generally an open question. Storing it on SSD sounds extreme, though.
/* Steinar */
(Chess is different because you have much more work between each memory access. It's very ALU heavy with all the bitboard operations.)
/* Steinar */
And I also see that I messed up half of my question, RAM has of course nothing to do with cache hits, only hash table hits. Do you have an idea how often a hit in the hash table occurs? And how much the gain in time is compared to re-evaluation of position?
/* Steinar */
> And how much the gain in time is compared to re-evaluation of position?
The gain can be [theoretically near] infinite! When you use an engine which provides you with a persistant hash feature (which should provide you the possibility of saving the hash-table to harddisk for a later resumption of analysis; funny side note: with your virtual memory SSD-idea you would just have to shut down "the virtual" to have it already saved on [and enable to reloaded from] harddisk - given there are no additional header-bytes of the engine added of course), and you analyze to depth 43[, "save it"] and restart, you will realize that the engine quickly jumps to depth 43, which saves you recalculating the first 42[+] depth levels... which btw. in average is around the time needed to calculate you depth 43, or half of depth 44, or a quarter of depth 45...
> Do you have an idea how often a hit in the hash table occurs?
This is [obviously] highly dependent on the engine, the size of the hash-table and even worse the position at all (a pretty famous position for a high hash hit-rate is Fine's #70). For the pawn hash, which is only a few KB big, you can expect rates of 95% to even 99,x%..., for general hash table we can speak more about lower double digits... just realize, that a modern single-processor computer calculates you ~20.000.000 nodes/sec, and with needing a few bytes for storing each node in hash, you basically have [i have taken 10 bytes/nodes for simplification] ~GBofRAM_x00.000.000 seats available... so expect a high turnaround-rate.
I did a test run now actually with 8GB vs 1 GB Hash size and according to Stockfish 8 after 10h I got a speedup of at most 1%. Now I didn't have large pages but still. I doesn't seem like that much :)
But that's just the "nps", maybe that's not an approriate meassure?
In the given position (I know, 1 position is way too little), I got to depth 65 in 38695s for 8GB but 49563s for 1 GB. So maybe the speedup is more like 25%? Which is the best way to meassure strength (of those 2, I guess actually playing games is the best).
"the first 42[+] depth levels... which btw. in average is around the time needed to calculate you depth 43"
> That's not really what I'm getting for Stockfish (for a few differet positions), it's much closer to time between 41 & 42 = half of time between 42 & 43? Are you sure about this?
The problem of testing chess-engines reaching a certain depth-level on modern multicore-processors&-systems is in its non-deterministic execution-behaviour: Every run looks different, and factor >=2x reaching a certain level and/or reseaching another primary move is always possible and in this exact run not reproducible (best what you can do is restart engine and/or clear hash, and attach engine-branches to constant cores (task-manager...)). For a clearer picture, i would advice selecting a much lower target-depth (middle 3x or low 4x), taking a lot runs for a single position and doing this with several other positions as well... the average here will tell you more. Another possibility is searching single-core only [again the engine being bound to a constant core].
> I guess actually playing games is the best
When its your aim, yes, but when infinite analysis is your aim, playing [blitz-]games is MickeyMouse-stuff... given corr-timeframes, playchess-games are simply full of mistakes and worthless for serious analysis (beyond opening-research of course).
In 2011 chess-engines (Rybka, Houdini, Robbolito, Hiarcs...) needed around the time they needed reaching up to depth 'n-1' to calculate depth 'n'... with highly prunning engines like Stockfish in '16 this indeed is often not the case, yet there can be instances of several "+"- or "-"-levels which exponentially can increase the needed time at any given depth-level.
But you think depth is a better measure than nps? How do they measure elo gains in the fishtest?
I also noticed that the nps tends to increase for each ply by a percent or so, any idea why?
> Is that from a soccer game?
No, just a few snapshots out of my dreams...
> How do they measure elo gains in the fishtest? But you think depth is a better measure than nps?
They test it by playing a lot [bullet] games against the old stockfish. You can do this as well, playing one engine with 8GB hash against a copy of it with 1GB at any time-control you desire... obviously, this primarly gives you conclusions about game-play (few [mill]seconds or minutes per move) and not too much about infinitive analysis (several hours to days a move) where some other rules apply. With most engines generating higher nps, higher depths are obviously reached in average[!] in less time - obviously you need to test a few positions a lot of times, calculating an average, experiencing this... there are exceptions with engines, who use additional resources (threads) to especially widen its tree[-search] regarding alternatives (Komodo [did this in former versions]).
> I also noticed that the nps tends to increase for each ply by a percent or so, any idea why?
Less pieces on board in search for easier calculation, less hash-turnaround and/or tb-hits, less starting-/orginization-overhead, your CPU-muscles gets warm...
Obviously we're waiting for your test-results, you're our man!
I'm doing the testing right now :), but without having too many test runs it does seem that the longer you run the better larger hash tables are (which seems logical). But it will take a few weeks to get an accurate result for very long analysis times :) How much "elo-strength" do you usually consider to add for each depth (or doubling of analyis time)? I started reading http://web.ist.utl.pt/diogo.ferreira/papers/ferreira13impact.pdf. They got a result of 66 elo points per unit of depth, although that was simply until depth 20.
Also I looked at the source code for Stockfish (mainly tt.cpp and tt.h) and to me the padding they do to align to cache line size seems like something that could be used more optimized :). But that's just from a theoretical point, might be hard to actually do anything useful... And also I'm thinking of changing the memory allocation to allow differently sized Hash tables (not just a power of 2). Probably a bad idea, but why not test it :)
can you please give every girl [<= age of 25] you meet till Christmas my nick here @rybkaforum? I want them to pm me, so they can test my engine!
> it does seem that the longer you run the better larger hash tables are (which seems logical)
Obviously that's the case, the more practical question is: Where is your personal breakeven-point regarding price/performance-gain, and always remember: Less RAM-slots populated can often be better timed and/or overclocked. The same question goes for the CPU/cores as well...
> They got a result of 66 elo points per unit of depth, although that was simply until depth 20.
Just remember that depth-levels of certain engines are not really comparable, i would always suggest calculation-time being the most fair&comparable factor.
There are some very intelligent people behind Stockfish, but obviously, as it is "optimized" being very portable, there is always some room for speed (asmFish, which may be buggy for some?!)... also here it is a development/testing-time to performance-tradeoff... in life you always have to give something to gain something else... i would give my pm and a decent check donated to your name for a contact-list of some cute Swedish girlz...
> And also I'm thinking of changing the memory allocation to allow differently sized Hash tables (not just a power of 2)
Yes, with 16GB RAM you can easily try 12GB hash.
Far&test well buddy.
please[!] tell us about your test-experiences&-results.
Antonius, a [high-potent!] loova of Swedish chicks&girls.
Well, no conclusion yet, testing is still undergoing, but I did realize that sometimes it may take about 2x sum of all previous plies for a ply. And sometimes it takes like a tenth. You know how this is chosen in for example Stockfish?
as i said above, "2x sum of all previous plies for a ply" is/was the norm (as the amount of needed calculations nearly increases to the power of 2), nowadays due aggressive prunining and other "intelligent tree-work" engines find shorter ways to deeper levels. Due the random nature of "multithread-processing", every run looks a little different...
I know you said it before :) This analysis is really taking alot of time though :)
Extensive testing of multiple positions clearly show that for 3days analysis the gain of using a lot more RAM (10x) is maybe about 20% (very rough estimations). So very close to your estimation? But I still think for longer analysis, such as 15days+ it would *perhaps* mean at least 100%? And for 100days+ analysis maybe even 10x faster. Do you agree on this theoretical analysis? :) And due to diminishing returns 100days analysis would be 5 elo points better than 10 days ;). I couldn't find any serious work about diminishing returns above depths of 20-30. Which is strange, wouldn't somebody look at it?
Some other questions have popped up though :) I was looking at https://sites.google.com/site/computerschess/stockfish-chess-benchmarks and if we compare:
45773 36 x64 2x Xeon E5-2696 v3 @2.80GHz Yorkman (the top one)
with for example the standard:
13797 4 x64 Intel Core i5-6600K @5.00GHz Dark_Wizzie
Wouldn't the difference actually be much smaller due to inefficiency of parallel alfa-beta search? According to another thread "Houdart believes that search inefficiencies with doubling of threads is about 20 percent. " (and I think this is in line with other discussions about hyperthreading where they want 30%+ N/s to actually turn it on). Maybe the value is slightly different since the introduction of LazySMP for Stockfish but I can't find an actual estimate of it anywhere..
The comparison for actual use would be: 0.8*0.8*0.8*45773=23435
So isn't the comparison skewed, or am I missing something?
And a third question :)
I understand that AsmFish is faster since it is in assembler, CFish was also faster due to NUMA support earlier but now that standard Stockfish has it is not(?) But why is the compiles on http://chess.ultimaiq.net/stockfish.html 8,5% faster than the abrok Stockfish compiles? Of course that may happen if you aren't using the fastest c++ standard libraries for the standard compiles, but why wouldn't the Stockfish team use them?
I was also thinking about the importance of avoiding hash collisions for greater depths which may mean that the current hash value size for Stockfish is not optimal at longer time controls. Don't know if somebody else has investigated this lately? But that would require some major analysis work (either theoretical or practical)...
Merry Christmas! :D
how far are you away from Stockholm?
> Extensive testing of multiple positions clearly show that for 3days analysis the gain of using a lot more RAM (10x) is maybe about 20% (very rough estimations). So very close to your estimation?
Thanks for your report! So you see, this is far of what "our" Taiwanese friend claimed&tested, and indeed very close to my estimation.
> But I still think for longer analysis, such as 15days+ it would *perhaps* mean at least 100%?
NEVER in a cold frozen day in hell it would mean that. Did you take other numbers at a few hours, half day, one and two days? I expect pretty much absolutely no real [life] change beyond that 20% after three days (remember, every run looks a little different...). Beyond that, as i played (and won most, not losing a single...) a few corr-games as well, never ever did i have the impression that an analysis beyond three days, without manual interaction, makes sense at all (i have written above something about analysis already).
> And due to diminishing returns 100days analysis would be 5 elo points better than 10 days ;). I couldn't find any serious work about diminishing returns above depths of 20-30. Which is strange, wouldn't somebody look at it?
Simply said: Someone running analysis for a few days is indeed interested in serious analysis, willing to manually guide the path... this can more or less lead to a few hundred more elo if you will (current ICCF-WCH dragonmist wrote about that in his chessbase-articles)... simply because most often you can outplay an unassisted engine if you want&need!
> So isn't the comparison skewed, or am I missing something?
Aside that i would be careful with "Dark_Wizzie"-results , it's more like 13797 * 1,6 [doubling to 8, minus 20%] * 1,6 [to 16] * 1,6 [to 32] * 1/8th*1,6 [from 16 to 32] = 56513 (+ the other 4 cores to 36), which is pretty reasonable given that a dual socket is way more complex and the Xeons are clocking way lower.
Maybe the guys/dragonmist can answer your question regarding stockfish-compiles.
> that the current hash value size for Stockfish is not optimal at longer time controls.
Indeed most engines, most noteably Houdart's Houdini in earlier versions, are/were optimized for quick engine-games, and there can/could be some useful tweaks for serious analysis.
Far well buddy!
> Did you take other numbers at a few hours, half day, one and two days? I expect pretty much absolutely no real [life] change beyond that 20% after three days (remember, every run looks a little different...). Beyond that, as i played (and won most, not losing a single...) a few corr-games as well, never ever did i have the impression that an analysis beyond three days, without manual interaction, makes sense at all (i have written above something about analysis already).
Well I agree with you on the real life thing, but I still think that from a theoretical viewpoint when for example each entry in the Hash table is worth a lot, approaching infinity as analysis time approaches infinity. From a theoretical viewpoint that is, from a practical viewpoint you are of course right.
I'm thinking of testing diminishing returns by comparing difference between d=19,20 vs difference between d29 and d30. My calculation said that a few hundre games would take a few weeks, so not right now :)
> Aside that i would be careful with "Dark_Wizzie"-results , it's more like 13797 * 1,6 [doubling to 8, minus 20%] * 1,6 [to 16] * 1,6 [to 32] * 1/8th*1,6 [from 16 to 32] = 56513 (+ the other 4 cores to 36), which is pretty reasonable given that a dual socket is way more complex and the Xeons are clocking way lower.
To be honest, I'm not sure about your calculation. The loss for each doubling of cores is 20%, so to compensate for that (to be able to compare them) 4cores @ X N/s would be equal in strength to 8 cores @ 1.25 X N/s. Which is what I wrote (except comparing it the other way around). Comparing with 2*0.8 (which you did) only makes sense if you do an actual doubling of cores (and not for comparing them). I hope I don't offend you - it is not my intention! :)
I've been reading up on LazySMP and realized depth testing with different amount of cores is not good either (since with more cores the search Tree gets thicker than normally). So matches with fixed times (like humans do) seems the way to go to compare different aspects, but with 500 games to get statistically assured values it will take ages :) One conclusion from all this testing is at least that AWS EC2 is cheap but cheap * a lot of time = not so cheap :)
Thanks for the benchmark, and good job on the high overclock (is it air or liquid? What MB and RAM are you using?)! But the IPC of Kaby Lake really seems bad? At least for chess, 13797/5.0Mhz=2759,4 while 14102/5.35Mhz =2635.9. So IPC is not better for Kaby Lake at all? Almost looks much worse but maybe that is just due to variation in the testing method? How much higher do you think the i5-6600K result is than it should have been? If we extrapolate the result for the 3.70GHz i5-6600K we get 9625*5/3.7=13006. And even if that was the correct result then 13006/5=2601 which is about 1.3% slower than 6700K.
But maybe the 7600K is easier to overclock, what do you think? Are you getting the Ryzen?
For chess it seems higher than that, probaby because of bmi and SSE instruction sets? I've been looking at the AVX512 soon to be launched (next generation) but I can't figure out if that will yield an improvement...
Performance per clock is a general metric that can vary depending on what we're testing. In my limited experience just benching Haswell vs Skylake the differences in IPC are a bit on the lower side compared to some of my other applications. Also IIRC speedup of BMI2 is like 1-3%. Of course, BMI2 worked with Haswell too. It is true that Kaby Lake features no IPC improvement over Skylake. It is a refinement in process that results in higher clocks (along with misc features with associated chipset that doesn't really affect performance). Recall that Devil's Canyon had same IPC as Haswell, it just clocked better. Since Sandy improvement after each generation decreased a lot. From Sandy to Haswell was the worst I think. An average Haswell was barely faster than average Sandy due to poor overclocking, but in this case we have instruction sets to think about as well.
More specifically on the discrepancy in results over my Skylake bench: As I mentioned previously, I got lucky with my Skylake bench. In that case the engine decided a different move and that caused the engine to think longer than it normally would. But, Sedat accepted it anyways. All of my future benches will not be like this for consistency. My original mentality when submitting that was to grab the highest possible result I could no matter what. The side effect is it made Kaby Lake look worse than it really is. I suggest you take my Kaby Lake result and divide it by the correct number so you know the kn/s @ 5ghz and assume that was my normal bench on Skylake.
As for my system: I bought a binned and delidded chip from Silicon Lottery. Ram is 2x8gb 3804 15-15-15-32. The idea of cache and performance was one I was pondering over as well, since one difference between 7700k and 7600k was more cache. In the end I decided the difference is too small to justify a huge price difference with binned chips. I am running air cooled via D14. When it comes to Ryzen I have to say: These days I do not do a lot of chess. My main priorities are gaming, and most of my games are CPU limited because they use only a few cores. This is why consumer Kaby Lake becomes the fastest CPU for my purposes. This is also why Ryzen does not fit my use case. That said I expect Ryzen to be a great deal for chess fans. Even 8350 held out for a very long time with good price/performance.
Refer to my overclocking charts for reference on how much these chips clock on average:
On SSD, my own testing showed that my 950 Pro did not performance significantly better than my 850 Pro when it comes to kn/s with heavy TB hits. I think my testing was done correctly but it may not. Too bad traced based analysis is out of reach to the common man...
So average on those were 5.05Mhz for Kaby and 4.70Mhz for Skylake. That may seem convincing, but since the sample size is so different (14 vs 142) maybe only the "best" OC aquired the Kaby Lake yet so that is why the result is higher? Or coolers are getting better? But anyway - that is probably not the case and that probably won't explain much of the difference anyway :)
> Wouldn't the difference actually be much smaller due to inefficiency of parallel alfa-beta search? According to another thread "Houdart believes that search inefficiencies with doubling of threads is about 20 percent. " (and I think this is in line with other discussions about hyperthreading where they want 30%+ N/s to actually turn it on). Maybe the value is slightly different since the introduction of LazySMP for Stockfish but I can't find an actual estimate of it anywhere..
I don't think you actually tried. For Stockfish 8, the link suggests Amdahl's law with, empirically, 4.5% non-parallelizable part.
But still - comparing 4 to 32 cores for which I used 0.8*0.8*0.8=0.51 as the relative performance factor - with these new LazySMP going from 4 to 32 gives a relative factor of 0.48 (actual speedup/expected "NPS" speedup=(13.36/3.52)/8=3.795/8=0.47)
That comparisson would then yield 0.47*45773=21513 which is just 56% higher than the 6600K.
And the fact that the measurements were even 36 cores makes it even worse? So the playing performance for the 2x Xeon E5-2696 v3 should be about 50% better than the OC 6600K? I can't imagine what the prize difference is, but maybe a factor 5 if everything is bought new?
For deeper analysis, the positions (the leafs) that are being judged by different cores in the LazySMP implementation are further apart (i.e less likely to overlap since the calculations are independent) I would think? But maybe I just haven't understood the iterative deepening of tha alpha-beta algorithm.
> So how many % would you say is representative? :)
While real-life numbers of an engine's cache-misses may be [well] below the above quicky-"tested" 38,959% (no clear hash, [a single] starting postion, just 19,x sec runtime...), you have to expect that cache-misses are relatively high compared to more standard (more memory-localized operating) applications... but well, what does it matter, what can ya do? Right, get the best engine-compiles, buy the best [desktop-]CPU and reasonable clocked&sized RAM, overclock a little and see what you get... a big and costly server-CPU is definitely not worth price/performance just for its bigger L3-cache (but maybe for its additional cores, in case you want afford it). Same goes for RAM: Prefer size over (overclocking-)speed, given a reasonable number [16GB, 1600GHz], doubling both will get you far off the performance gains promised by your stated Taiwan snakeoiler... but when you want to afford it, why not (it's like sports cars...).
> Do I think correctly that the Hash table becomes more and more important for each added ply and after a very very long time
Yes, when you carefully read experiences of the other users here, you realize that there seems to be a pretty low "perfect" hash-size for bullet-games which increases heavily when you aim for [as you said 12h] infinite analysis... yet, from my practical experience analyzing corr-games, there is some sweet-spot for doubling search-time or hash-size where everything beyond leads you to very diminshing results, but you will see for yourself soon. Keep in mind,that a modern single-processor can calculate/lookup you ~20 000 000 nodes/sec, so you can basically expect the hash being flushed every few minutes completely = there is no practical hash-size where you can store all calculations of several hours let alone several days.
> Is there somebody who has tried storing it on SSD
I'm surprised Steinar didn't point out above, but obviously a SSD is absolutely not suited for this high number of write-cycles... it's primary use in chess is for tablebases (high number of read-cycles, [nearly] no write-cycles).
And can you please send a personal greeting to Jenbaby?
/* Steinar */
> The thing about wearing out SSDs is true but greatly overblown.
For normal daily use yes, but using a SSD as virtual memory for the hash-table you can reach the ~hundreds Terabytes where customer SSD's start to fall within months:
//===> Antares <===
People don't realize it, but flash does have “seek times” just like spinning drives. Except it's more per-request overhead/latency, and on good devices you can get around it by doing lots of simultaneous requests. On mobile flash, not so much. If you ever use the offline routing functionality on Google Maps (iOS/Android), spare a thought for all the I/O optimization we did to make it feel fast :-)
/* Steinar */
> People don't realize it, but flash does have “seek times” just like spinning drives.
Comparing sequential 4MB write speed with random 4KB write speed barely shows a speedup of ~1.5, most of it due first one's chunks being 1024 times bigger obviously... i don't think random-writes (as with a hash-table) will increase a SSD's lifetime noticably beyond a low factor 1_point_x.
//===> Antares proves the point <===
(not to speak about a sequential 4MB-write being more "concurrent" than a 4KB-one can ever be?)
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill