i7 950@ 4206 Mhz
no large pages
256mb hash
start position 1 min run
Houdini 2.0c x64
windows vista premium 64bit
1 core = 2668
2 core = 5367
3 core = 8122
4 core = 10486
Deep fritz 12 gui
i7 980x@ 4410 Mhz
no large pages
256mb Hash
start position 1 min run
Houdini 2.0c x64
windows 7 premium 64 bit
1 core = 2773
2 core = 5565
4 core = 11058
6 core = 16149
Deep Fritz 12 gui
Hurnavich
I cleared hash at start of each core test,what you see is the truth of the test.
Hurnavich.
I am 100% confident in your test.
I am 90% confident that Houdini is measuring kn/s on one core and reporting that number times the number of cores.
I see where you are coming from now interesting.
Many thanks
Hurnavich
> I am 90% confident that Houdini is measuring kn/s on one core and reporting that number times the number of cores.
You're wrong, the node speeds of Houdini are the real values.
Tell us, how would an engine know the speed on one core when all 6, 8 or 12 cores are running all the time?
OK, thanks for the correction. In that case, for reasons specified below, I will claim that the kn/s node count is not very meaningful.
Tell us, how would an engine know the speed on one core when all 6, 8 or 12 cores are running all the time?
It's not uncommon in other multithreaded applications for speeds to be calculated on only one core which also handles the GUI interface. But maybe I am not understanding your question.
In any event, I don't see this as a valid method of gauging the strength of a hardware setup, its stated purpose. As as an example, I am 100% certain that Houdini will perform a lot better on a single core at 4GHz than on a 12-core machine with all clock frequencies scaled down to 1/3 GHz, even though both will produce almost exactly the same kn/s node count.
I have always argued for tests based on time to solution, rather than kn/s based measurements. Of course these tests must be run many times with cleared hash to reduce the variance due to MP variability, and testing on multiple positions is also important to reduce bias where one position favors one engine over another.
> In any event, I don't see this as a valid method of gauging the strength of a hardware setup, its stated purpose. As as an example, I am 100% certain that Houdini will perform a lot better on a single core at 4GHz than on a 12-core machine with all clock frequencies scaled down to 1/3 GHz, even though both will produce almost exactly the same kn/s node count.
Indeed, we're discussing this below and even giving quantitative estimates of the effect.
> Hi,
>
> I cleared hash at start of each core test,what you see is the truth of the test.
>
> Hurnavich.
I suspect what you see is PART of the truth. For every doubling of cores you're getting roughly a doubling of np/s. That may be correct, it may not - only Robert H really knows.
BUT what I learnt from Bob H is that for every doubling of cores you lose 30% performance due to parallel search inefficiency.
Example: suppose you have Houdini 2 running at 1000 kn/s on a single core. Now suppose (just suppose) it runs at 1300 kn/s on two cores. You may THINK you're program is running quicker, but actually it's running at the same speed. In order to speed up you must achieve MORE than a 30% increase per doubling of cores.
Your tests report nearly a 100% speedup per doubling of cores - which is fantastic - but even if that is a true reading, don't be fooled into thinking it's 4x faster running on a quad as it is on a singe core.
This is very interesting, many thanks for your insight.
Hurnavich
> Example: suppose you have Houdini 2 running at 1000 kn/s on a single core. Now suppose (just suppose) it runs at 1300 kn/s on two cores. You may THINK you're program is running quicker, but actually it's running at the same speed. In order to speed up you must achieve MORE than a 30% increase per doubling of cores.
The 30% is probably an over-estimation, I think 20% is closer to reality (at least for Houdini).
This effect explains why hyper-threading doesn't really work for chess engines - a 20% speed increase while doubling the number of threads is not useful.
So running at 30,000 kN/s with 12 threads should produce about the Elo strength of a single thread running at about 15,000 kN/s.
> The 30% is probably an over-estimation, I think 20% is closer to reality (at least for Houdini).
> This effect explains why hyper-threading doesn't really work for chess engines - a 20% speed increase while doubling the number of threads is not useful.
>
> So running at 30,000 kN/s with 12 threads should produce about the Elo strength of a single thread running at about 15,000 kN/s.
That's really helpful, thanks.
I'm currently running a Quad i7 @ 3.6Ghz. I hope in 3 months to upgrade to a dual xeon (16 cores total) @ 4Ghz. From what you've said it seems I could hope for a 3x speed increase (1/(1.2x1.2)) x 4 times as many cores x slight increase in clock speed). I'm guessing that would give me +90 elo?
.Your Elo estimate appears reasonable, it would be interesting to run a match between the 2 setups to get the real value.
> 16 cores at 4 GHz should produce about 40,000 kN/s, very nice
.
> Your Elo estimate appears reasonable, it would be interesting to run a match between the 2 setups to get the real value.
Yes I intend to do this. Apart from hardware comparison it will give a rough value of Houdini on 16 cores vs the Rybka Cluster on 40 cores (admittedly the Cluster 12 months ago)
What tests did you perform, with what engines?
Dr.D
> So running at 30,000 kN/s with 12 threads should produce about the Elo strength of a single thread running at about 15,000 kN/s.
After testing the Dual Xeon last week I've been reading over some of your old posts, including this one.
Is the right way to do the maths to say each doubling = 20% efficiency loss.
Therefore from 12 cores to 1 = 0.8 x 0.8 x 0.8 x 0.9 (because it's a 12 core not a 16 core) = 0.4608
0.4608 x 30,000 kns = 13824 kn/s (the equivalent on a single core) Correct?
My results with the E5 Xeon 16 core @ 3 GHz got about 26,000 kn/s which isn't great. But the clock speeds are low so...
Please try the test and post your results.
many thanks
hurnavich,
Besides that in average his node counts said that 16 out of 17 nodes did nothing, do you think that such recalculation would be worth it?
But the point is that neither choice is a good basis for evaluating a platform. The uncorrected version assumes all nodes are useful, while Rybka gives you Vas' view of what you should be getting. Neither gives a true estimate.
> one needs to measure 'Time to Solution' {TTS} and vary the number of cores for each test.
I did experiment that on 2xCore, quite a few years ago...
see http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?pid=27295#pid27295
- Chess position
- Engine
- Cores
- Hash size
Ideally we should have at least 1000 runs for each mp test, so the solutions should not take more than a minute on SP.
It might turn out that each of the engines solves the problem with the same distribution of times (maybe log normal as you suggested), and in that case we could characterize engines by the parameters of the distribution. (This wouldn't be a complete characterization because it doesn't address how bad the alternative move is when the engine doesn't find the solution).
> maybe log normal
Actually, solution happens mainly at depth n, then all the TTS at depth n are shown as bell curve or main hump, but solution also happens (less often) at depth n+1 n-1 n+2 n-2... so you get, for the distribution curve, small "humplets" at the sides of the main hump.
Also, I must report a mistake (which I will correct below):
since I used 256 MB hash for all the dualcore tests, I should have used 128 MB hash for the single core test, in order to compare correctly.
But the reproducible single core test (69 sec, the vertical red line) was made with the same 256 MB hash
I wrote:
using 1 processor, I found 69 sec (of course reproductible)
using 2 processors, I did 200 (automatic) tests : I got timings ranging from 27 to 184 sec
The mean of the 2-proc timings is 69 sec (note that for this position, it is not better than the 1-proc timing: where is the x1.7 improvement of bi-processors? :-)).
The 50% median is 61 sec (99 values are less than 61, 101 values are larger than or equal to 61).
With 1 processor (core) and 128 MB hash, the (reproducible) solution time is 93 sec. So the gain 2core/1core can be evaluated, on that particular position (and with Rybka 232a) as 93/61 = 1.52
Not 1.7, but not too bad...
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill