Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / Core performance
- - By Hurnavich (****) Date 2012-01-02 11:49
Hi,

i7 950@ 4206 Mhz
no large pages
256mb hash
start position 1 min run

Houdini 2.0c x64
windows vista premium 64bit

1 core = 2668
2 core = 5367
3 core = 8122
4 core = 10486

Deep fritz 12 gui

i7 980x@ 4410 Mhz
no large pages
256mb Hash
start position 1 min run

Houdini 2.0c x64
windows 7 premium 64 bit

1 core = 2773
2 core = 5565
4 core = 11058
6 core = 16149

Deep Fritz 12 gui

Hurnavich
Parent - - By Banned for Life (Gold) Date 2012-01-02 11:55
Too good to be true. Looks like the numbers may just be measured on one core and then multiplied by the number of cores...
Parent - - By Hurnavich (****) Date 2012-01-02 12:11
Hi,

I cleared hash at start of each core test,what you see is the truth of the test.

Hurnavich.
Parent - - By Banned for Life (Gold) Date 2012-01-02 12:18
You are misinterpreting my remark.

I am 100% confident in your test.
I am 90% confident that Houdini is measuring kn/s on one core and reporting that number times the number of cores.
Parent - By Hurnavich (****) Date 2012-01-02 12:39
Hi,

I see where you are coming from now interesting.

Many thanks

Hurnavich
Parent - - By Stonehenge (***) Date 2012-01-02 17:28

> I am 90% confident that Houdini is measuring kn/s on one core and reporting that number times the number of cores.


You're wrong, the node speeds of Houdini are the real values.
Tell us, how would an engine know the speed on one core when all 6, 8 or 12 cores are running all the time?
Parent - - By Banned for Life (Gold) Date 2012-01-02 19:20
You're wrong, the node speeds of Houdini are the real values.

OK, thanks for the correction. In that case, for reasons specified below, I will claim that the kn/s node count is not very meaningful.

Tell us, how would an engine know the speed on one core when all 6, 8 or 12 cores are running all the time?

It's not uncommon in other multithreaded applications for speeds to be calculated on only one core which also handles the GUI interface. But maybe I am not understanding your question.

In any event, I don't see this as a valid method of gauging the strength of a hardware setup, its stated purpose. As as an example, I am 100% certain that Houdini will perform a lot better on a single core at 4GHz than on a 12-core machine with all clock frequencies scaled down to 1/3 GHz, even though both will produce almost exactly the same kn/s node count.

I have always argued for tests based on time to solution, rather than kn/s based measurements. Of course these tests must be run many times with cleared hash to reduce the variance due to MP variability, and testing on multiple positions is also important to reduce bias where one position favors one engine over another.
Parent - By Stonehenge (***) Date 2012-01-02 20:02

> In any event, I don't see this as a valid method of gauging the strength of a hardware setup, its stated purpose. As as an example, I am 100% certain that Houdini will perform a lot better on a single core at 4GHz than on a 12-core machine with all clock frequencies scaled down to 1/3 GHz, even though both will produce almost exactly the same kn/s node count.


Indeed, we're discussing this below and even giving quantitative estimates of the effect.
Parent - - By Werewolf (*****) [gb] Date 2012-01-02 12:31

> Hi,
>
> I cleared hash at start of each core test,what you see is the truth of the test.
>
> Hurnavich.


I suspect what you see is PART of the truth. For every doubling of cores you're getting roughly a doubling of np/s. That may be correct, it may not - only Robert H really knows.

BUT what I learnt from Bob H is that for every doubling of cores you lose 30% performance due to parallel search inefficiency.

Example: suppose you have Houdini 2 running at 1000 kn/s on a single core. Now suppose (just suppose) it runs at 1300 kn/s on two cores. You may THINK you're program is running quicker, but actually it's running at the same speed. In order to speed up you must achieve MORE than a 30% increase per doubling of cores.

Your tests report nearly a 100% speedup per doubling of cores - which is fantastic - but even if that is a true reading, don't be fooled into thinking it's 4x faster running on a quad as it is on a singe core.
Parent - By Hurnavich (****) Date 2012-01-02 12:41
Hi,

This is very interesting, many thanks for your insight.

Hurnavich
Parent - - By Stonehenge (***) Date 2012-01-02 17:34

> Example: suppose you have Houdini 2 running at 1000 kn/s on a single core. Now suppose (just suppose) it runs at 1300 kn/s on two cores. You may THINK you're program is running quicker, but actually it's running at the same speed. In order to speed up you must achieve MORE than a 30% increase per doubling of cores.


The 30% is probably an over-estimation, I think 20% is closer to reality (at least for Houdini).
This effect explains why hyper-threading doesn't really work for chess engines - a 20% speed increase while doubling the number of threads is not useful.

So running at 30,000 kN/s with 12 threads should produce about the Elo strength of a single thread running at about 15,000 kN/s.
Parent - - By Werewolf (*****) [gb] Date 2012-01-02 17:52

> The 30% is probably an over-estimation, I think 20% is closer to reality (at least for Houdini).
> This effect explains why hyper-threading doesn't really work for chess engines - a 20% speed increase while doubling the number of threads is not useful.
>
> So running at 30,000 kN/s with 12 threads should produce about the Elo strength of a single thread running at about 15,000 kN/s.


That's really helpful, thanks.

I'm currently running a Quad i7 @ 3.6Ghz. I hope in 3 months to upgrade to a dual xeon (16 cores total) @ 4Ghz. From what you've said it seems I could hope for a 3x speed increase (1/(1.2x1.2)) x 4 times as many cores x slight increase in clock speed). I'm guessing that would give me +90 elo?
Parent - - By Stonehenge (***) Date 2012-01-02 18:12
16 cores at 4 GHz should produce about 40,000 kN/s, very nice :cool:.
Your Elo estimate appears reasonable, it would be interesting to run a match between the 2 setups to get the real value.
Parent - By Werewolf (*****) [gb] Date 2012-01-02 18:25

> 16 cores at 4 GHz should produce about 40,000 kN/s, very nice :cool:.
> Your Elo estimate appears reasonable, it would be interesting to run a match between the 2 setups to get the real value.


Yes I intend to do this. Apart from hardware comparison it will give a rough value of Houdini on 16 cores vs the Rybka Cluster on 40 cores (admittedly the Cluster 12 months ago)
Parent - - By Banned for Life (Gold) Date 2012-01-02 20:26
No doubt Houdini scales very well, but 90 Elo would be an unprecedented improvement for going from 4 to 16 cores (OK, a few Elo will come from the 10% speed increase). It will be an interesting experiment. I've gut the under though! :wink:
Parent - - By Stonehenge (***) Date 2012-01-02 20:40
Instead of making one-liners, could you share some of your experience in the field?
What tests did you perform, with what engines?
Parent - - By Banned for Life (Gold) Date 2012-01-02 20:47
My experience with 16 cores? None whatsoever. I won't plunk down for one of these machines, or even a 12 core unit until someone shows that they justify the added cost over a large number of quads. Since I always have many games going on, there is no obvious disadvantage to splitting over a larger number of machines, rather than using a less efficient machine with a lot more cores.
Parent - By Dr.Wael Deeb (***) [jo] Date 2012-01-03 09:24
I am with you on this....
Dr.D
Parent - By Banned for Life (Gold) Date 2012-01-02 20:57
Actually, I should qualify that. I have not used multi-socket SMP machines for chess, but my business uses them in large quantities.
Parent - By Geomusic (*****) Date 2012-01-03 03:57 Edited 2012-01-03 04:02
Yes, but what is the cost $ per kn/s  divided by relative elo gain? Can someone make a chart for Intel vs AMD chips?
Parent - - By Werewolf (*****) [gb] Date 2012-03-30 16:45

> So running at 30,000 kN/s with 12 threads should produce about the Elo strength of a single thread running at about 15,000 kN/s.


After testing the Dual Xeon last week I've been reading over some of your old posts, including this one.

Is the right way to do the maths to say each doubling = 20% efficiency loss.

Therefore from 12 cores to 1 = 0.8 x 0.8 x 0.8 x 0.9 (because it's a 12 core not a 16 core) = 0.4608

0.4608 x 30,000 kns = 13824 kn/s (the equivalent on a single core) Correct?

My results with the E5 Xeon 16 core @ 3 GHz got about 26,000 kn/s which isn't great. But the clock speeds are low so...
Parent - By Lukas Cimiotti (Bronze) [de] Date 2012-03-31 17:36
In case the efficiency loss is really 20% per doubling Houdini doesn't scale well. For Rybka it's only 15% - a factor of 1.7 per doubling of cores. I use a simplified math: cores^(ln1.7/ln2) which is cores^.7655. The 5% for going from 15 to 16 cores comes from (16/15)^.7655.
Parent - By Hurnavich (****) Date 2012-01-02 12:15
Hi,

Please try the test and post your results.

many thanks

hurnavich,
Parent - - By Uly (Gold) [mx] Date 2012-01-02 18:55
This is easy to test, just, play Houdini is one core and see how Houdini in 2 cores scores against it. Repeat with 2 v 4 cores. The result should be about the same, otherwise, Houdini it's lying or it's counting nodes that do nothing.
Parent - - By Banned for Life (Gold) Date 2012-01-02 19:24
It's counting nodes that do nothing. A better test would be to run at full speed on one node, and half speed on two nodes, quarter speed on four nodes, etc. All will have the same reported kn/s count allowing the reduction in Elo for multi-core to be measured directly.
Parent - - By Uly (Gold) [mx] Date 2012-01-03 00:26
I think Vas tried to fix that with Rybka's "recalculation" of nodes, when the node count should correlate with elo (though he overshoot).

Besides that in average his node counts said that 16 out of 17 nodes did nothing, do you think that such recalculation would be worth it?
Parent - By Banned for Life (Gold) Date 2012-01-03 00:30
Rybka's estimate is artificial, as are the others. As I recall, the nodes on the master are counted and the number of cores and estimated improvement based on that number are added in. Then the whole thing is scaled down to a lower number. I don't think the final scale factor is a big deal as long as you're comparing Rybka to another Rybka, same version.

But the point is that neither choice is a good basis for evaluating a platform. The uncorrected version assumes all nodes are useful, while Rybka gives you Vas' view of what you should be getting. Neither gives a true estimate.
Parent - - By Razor (****) [gb] Date 2012-01-02 14:23
As I said elsewhere - this is a measure that is pointless to make - one needs to measure 'Time to Solution' {TTS} and vary the number of cores for each test.  Run the test several times on each CPU core test {e.g., at least three times for 1xCore, 2xCore, 4xCore, and so on} only this will tell you as someone wanting to know the speedup gained in solving a problem - using the KN/S measure {even if there was some control on how the programmer has implemented the KN/S measure - which there is not - look at Alan's comments on this thread} gives no indication on the TTS.  People draw conclusions that if, for example, 4xCore is 10x faster than say 1xCore that it follows that the TTS will have a similar results.  This in my view is unproven.  In fact we very often see claims like engine A is so much better than engine B and yet when we give a problem to solve to say any of the top-10 engines, that most of the time they all agree on the next move to play, so reconciling some of the statements made by certain people on this forum that engine A is so much better than engine B does need some 'judgement' applied.  The great thing about the TTS measure is that you are using 'Time' that cannot be changed by the programmer and you are using the same set of problems for each core multiplier test {best to have a range of problem types so that you can see the effect of changing cores at different stages of a games}.
Parent - - By ernest (****) [fr] Date 2012-04-01 23:57

> one needs to measure 'Time to Solution' {TTS} and vary the number of cores for each test.


I did experiment that on 2xCore, quite a few years ago...
see   http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?pid=27295#pid27295
Parent - By Razor (****) [gb] Date 2012-04-02 05:41 Edited 2012-04-02 06:08
Very interesting Ernest; whilst the work you did is not exhaustive, it does indicate that the claim of 1.7 can so easily be misused and probably continues to be to this very day.  I wonder if anyone else has extended the work you started?
Parent - - By Banned for Life (Gold) Date 2012-04-02 06:19
Thanks for bringing this back! If we can agree on a number of positions to be solved, and then come up with a tool to automate TTS measurements, I am still interested in studying the distribution of these values for a number of variables:

- Chess position
- Engine
- Cores
- Hash size

Ideally we should have at least 1000 runs for each mp test, so the solutions should not take more than a minute on SP.

It might turn out that each of the engines solves the problem with the same distribution of times (maybe log normal as you suggested), and in that case we could characterize engines by the parameters of the distribution. (This wouldn't be a complete characterization because it doesn't address how bad the alternative move is when the engine doesn't find the solution).
Parent - - By ernest (****) [fr] Date 2012-04-03 15:59 Edited 2012-04-03 19:02

> maybe log normal


Actually, solution happens mainly at depth n, then all the TTS at depth n are shown as bell curve or main hump, but solution also happens (less often) at depth n+1 n-1 n+2 n-2... so you get, for the distribution curve, small "humplets" at the sides of the main hump.

Also, I must report a mistake (which I will correct below):
since I used 256 MB hash for all the dualcore tests, I should have used 128 MB hash for the single core test, in order to compare correctly.
But the reproducible single core test (69 sec, the vertical red line) was made with the same 256 MB hash

I wrote:
using 1 processor, I found 69 sec (of course reproductible)
using 2 processors, I did 200 (automatic) tests : I got timings ranging from 27 to 184 sec

The mean of the 2-proc timings is 69 sec (note that for this position, it is not better than the 1-proc timing: where is the x1.7 improvement of bi-processors?  :-)).
The 50% median is 61 sec (99 values are less than 61, 101 values are larger than or equal to 61).


With 1 processor (core) and 128 MB hash, the (reproducible) solution time is 93 sec. So the gain 2core/1core can be evaluated, on that particular position (and with Rybka 232a) as 93/61 = 1.52
Not 1.7, but not too bad...   :smile:
Parent - By Banned for Life (Gold) Date 2012-04-03 22:28
The point I was trying to make with Vas, apparently not very successfully, was that some distributions are characterized well by their mean, while others are not. I suspect that the nature of this problem, where finding the solution at any time before the cutoff point (where the engine makes a move) is equal, and where finding the solution at any time after the cutoff point is also equal (the best move was overlooked in the allocated time), is not in the 'well characterized by the mean' category. It seems like it would be much more important to know the variance of the distribution, with better engines having a larger percentage of solutions at a move time of interest (which is likely to be maximized by minimizing the variance, even if this increases the mean to some extent).
Parent - By Banned for Life (Gold) Date 2012-04-02 06:21
Furthermore, I think that Vas was wrong on this topic, but more experimentation work needed to be done to show this...
Up Topic The Rybka Lounge / Computer Chess / Core performance

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill