Not logged inRybka Chess Community Forum
Up Topic Rybka Support & Discussion / Aquarium / several single-core engines vs. one multi-core engine
- - By sbm (*) Date 2017-10-21 14:35
Hi there,

I hope that I have read all the relevant topics to find the answer for the issue that I'm looking for. It is possible that I have missed something related, but I did not found a proper answer. So, what I would like to know is related to the "several single-core engines vs. one multi-core engine" in IDea that was presented in the Parallel Search and IDeA video by Carl Bicknell.
How one can use/implement the cooperative use of several instances of the same engine simultaneously, on a very same position ? I hope that is understandable what I'm looking for, but to put it the other way around: I want to achieve the similar behavior that of when starting a simple IA, with one engine using all cores of the CPU, but using say 1 engine on every core of the CPU (lets say 4 cores in total, therefor 4 instances of the same engine). thanks
Parent - - By Ghengis-Kann (***) [us] Date 2017-10-23 19:47
The way you implement this is by going into the engines tab of the IDEA window and either typing a number or clicking the little arrow to change the number of instances of an engine that will be used by IDEA.  The number of threads per engine is assigned in the Engines window.

Plusses and minuses of hyperthreading is a perpetual argument, but in any case it does give you more addressable cores, so for the sake of providing an example I will say we have a quad core processor with hyperthreading enabled, which gives up to 8 addressable cores.

You should leave at least one of these unassigned so Aquarium itself has CPU power available, but I wouldn't give it more than 2.
Let's say we leave 2 cores available for Aquarium and other programs, leaving us with 6 to assign for analysis.

For straight up infinite analysis it is best to go into the engines tab, set the engine to run on 6 threads, and you are good to go.
For automatic expansion using IDEA I would set the number of threads to 1 in the engine settings and run 6 instances of it in IDEA.

My current analysis method relies on using infinite analysis to send positions to IDEA, and for this you need 2 different engines.
Suppose you bought the Houdini Aquarium and got Stockfish for free.
One reasonable possibility would be to assign 2 threads to Houdini and 1 thread to Stockfish in the engine settings, then assign Houdini as the IA engine and tell IDEA to use 4 instances of Stockfish.

Hope that helped,
Ghengis-Kann
Parent - - By sbm (*) Date 2017-11-03 09:58 Edited 2017-11-03 10:18
Thanks. Well, I think it is better for me to go through this process with the help of some screenshots, because even though I understand what You are saying I do got stuck at the end.
So, the below image contains the "the engines tab of the IDEA window" and the settings "to change the number of instances of an engine"
Parent - - By sbm (*) Date 2017-11-03 10:10 Edited 2017-11-03 10:18
The next image refers to the settings regarding the "...the engines tab, set the engine to run on 6 threads, and you are good to go." Well, in my case there are no logical cores, therefor I have only 4 cores/threads in total, so if I leave 1 for AQ, I have 3 cores/threads for the engine
Parent - - By sbm (*) Date 2017-11-03 10:12 Edited 2017-11-03 10:21
In the next image are the "For straight up infinite analysis it is best to go into the engines tab..." settings setup. 1. Sandbox 2. Custom 3. Engine 4. Personalities 5. Cores/threads 6. Saving
This is in case of Sandbox
Parent - - By sbm (*) Date 2017-11-03 10:14 Edited 2017-11-03 10:27
If I properly understand it this are the sequences of the setup.

So, here is the situation when one should use the several instances of an engine to analyse, but what is the workflow to achieve that?! How do You start it?

Let say that is this position after black second move in the Najdorf. This is a new IDeA project.
Here I would like to use three instances of Komodo. And for example I would like to use this 3 instances of the engine on this position and with Auto-play, and set it for depth 31.

In my understanding when one starts IDeA with this setup the result should be: ALL the 3 instances of the engine should work on this very position until the analysis achieved the predefined depth of 31, then make the move (in this case the 3 move of white) and continue to work the same way on the next move of black. But it does not do this..it starts only one engine. It is obvious that I missed something.
Parent - - By dickie (**) [gb] Date 2017-11-04 07:49
You are confusing instances with threads. One task will only ever use one instance of an engine. You can, as you have done here with Komodo, run the engine instance with more than one thread, but it is a less efficient use of IDeA. The point that Karl B was trying to make is that running a task for say 10 minutes with a 1 threaded instance will give a better result than running it for 5 minutes with a 2 threaded instance. The 1 threaded approach makes more efficient use of the cpus available to you.
Parent - - By sbm (*) Date 2017-11-09 17:53
Thanks dickie for replying. Well, I definitely confuse something :confused:, but that is not the instances with threads... unfortunately. The screenshots are somehow misleading as they where taken before I set the threads to 1. So every engine had only 1 thread.
Not much to confuse here: if we are talking about a 4 Core processor, then we can manage 4 threads from a process (engine) in the same time. Or, we can manage 4 threads from 4 processes (4 instances of engines)  in the same time. But in the mean time I think I realized that what I tried to accomplish is not quite possible. Or IDeA did things differently and made the results of the calculation of this separate engines somehow "glued" together?! Because even though this engines are running simultaneously on the same position, they calculation isn't shared between them, they do not have shared hash table so on so fort...
Parent - - By pawnslinger (***) [us] Date 2017-11-10 17:10 Edited 2017-11-11 09:17
The number of cores the cpu has is not the important factor.  With modern cores, like Intel's Kaby Lake or Coffee Lake, or AMD's Ryzen, the hyper-threading is impressive, especially when compared to the early versions from 8 or 9 years ago.  So the important thing is the thread count that the cpu supports.  My current cpu has 8 cores and 16 threads... so I routinely run 10-14 instances of my engine of choice (Stockfish), each set to 1 thread each.  I know that IDeA has a parameter to control  the "count" of each engine, but that has not worked well for me, so I have a separate engine setup for each instance I run, right down to separate folders (even though they are technically the same engine)... and I name each exe a different name, i.e. stockfish_1.exe, stockfish_2.exe, etc.  I do this because when I upgraded to Windows 10, windows seemed to load only 1 copy of the program (so it would have shared stuff), and I didn't like that, I found it slowed everything tremendously.  So I have forced windows to load a complete separate copy of each instance of an engine... I even have multiple copies, on different HDDs, of the tablebase I use - to spread the I/O over several devices... clogging up 1 device with too much I/O can also slow things down... best to spread it around.
Parent - - By sbm (*) Date 2017-11-19 09:25
Well, in my case it is important..as I have only 4 physical cores :eek:. No hyper-threading/SMT involved here yet. I know it is way too obsolete and slow, but that's what I use. Therefor it is imperative to find and use the best setup.  Anyway, the thread count that a cpu support, -as far as I know- Intel and AMD both use the 2 way smt (does not matter how fancy names they give to they implementation), so they double the threads per core number...not more. Hm...the thread count control parameter within IDeA...never thought about that this could be an issue...I just checked in process explorer and shows that I have 3 (or 4 ..if I set it so) instances loaded by windows (windows 10 Pro x64) but it is good to know that You found how to avoid this kind of issues...btw. from your experience, roughly what performance % You managed to preserve by optimizing the I/O and using completely separated exe-s as processes?
Parent - - By pawnslinger (***) [us] Date 2017-11-19 17:52 Edited 2017-11-20 05:27
What percentage improvement?  Hard to say, I never did any precise benchmarking.  My setup just evolved over time.  As I noticed bottlenecks, I tried to figure ways around them.  For the longest time I ran with an old Intel cpu under Windows Vista... so I gained a lot of experience trying to streamline and optimize.  What I always tried to avoid was overloading any 1 thing, be it a cpu core or disk drive.  I noticed when things get overloaded, the system stopped responding well, Windows would refuse to redraw the screen, etc.  So thru this sort of stumbling along, I found ways that worked for me.  On the old Intel cpu I tried to get things tuned in so that approximately 1,000 events per hour were pushed thru the queue, and generally that resulted in 6 compute threads at a depth of 19 or so.  With my new hardware, with better SMT, and better IPC (instructions per clock), I still try for 1,000 events per hour, but thanks to the efficiency of the new hardware I am pushing 12 threads on average at a depth of 25-26 on average.  So you can see the new hardware produces much better analysis.  I have quite an array of disk drives, that have grown over time.  I started with just 1... a long time ago... but now I have 2 SSDs, 4 internal multi-terabyte mechanical drives (a couple of WD Blacks and a couple of Greens), and 1 external WD My Book drive for backup.  The monster has grown over quite some time.  This particular system started with a single core Pentium, then the quad core i7-920, and now AMD's Ryzen.  What a trip.  Upgrading as I went.  Still using the same computer case that I started with, an old Antec 900 case -  I don't even know if they make those anymore.  Always upgrading and playing with it.... my hobby, maybe more so than Chess.
Parent - - By Ghengis-Kann (***) [us] Date 2017-11-20 22:02
Hi Pawnslinger.

The hardware threads you are referring to are exactly the same as what I call logical cores.

Intel CPUs starting around i7 (Ivy Bridge) have 2 hardware threads per core.
These hardware threads are computational engines with a small amount of locally available memory.
AMD does the same thing and calls these pairs bulldozer groups.
Each pair of hardware threads shares a single memory bus with which to communicate with programs or higher levels of memory (motherboard level cache, RAM, or drives).

The degree to which hyperthreading improves performance depends on how much of a traffic jam there is on the shared memory buses.
My own experiments on an i7 processor using a benchmark utility called Sandra Lite show a 33% improvement from enabling hyperthreading in the worst case scenario (continuous 100% CPU usage on simple arithmetic calculations).

Modern operating systems do a good job of balancing the load among the available resources using a NUMA architecture, which stands for Non-Uniform Memory Access.
"NUMA awareness" shows up as an option for some of the engines (e.g. Houdini), but only really comes into play when you have more than one CPU chip on a single motherboard. Otherwise just set it to enabled at address zero.

I also create separate instances of engines with their own folders, but only on my remote computers.
Each instance is given a unique name and port number so they are recognized by Aquarium as distinct engines.

I have also experienced Aquarium getting overloaded if you ask it to process too many tasks in a given time frame.
It appears that the program itself only runs on one thread.
Maybe multi-threaded execution should be added to the inevitable "Wishes for Aquarium 2019" topic...
Parent - - By pawnslinger (***) [us] Date 2017-11-21 04:18 Edited 2017-11-21 04:33
I am not exactly sure when SMT started, but I believe it was prior to Ivy Bridge.  I first had a cpu with SMT on an old Pentium, but it was so clunky that I turned it off completely...

With Ryzen, AMD has left the Bulldozer architecture behind (and thrown it into the dust bin).  If you have not checked out Ryzen, I suggest that you lookup a few videos on YouTube... it just came out this last Fall, and it implements full SMT... not that Bulldozer garbage.  I have a laptop with Bulldozer and it is really pathetic.  Ryzen on the other hand is quite good.  And the SMT implementation is the best I have seen... there may be better, but you couldn't prove it by me.  When I use a hyper-threaded core, as far as I can tell, there is no degradation in Stockfish performance.  I routinely get 1,000 kNps per thread (single threaded) per instance of Stockfish (no matter how many instances are actively computing).  When I compare that with my old i7, it is like a night and day difference.  Of course, I haven't used a more recent Intel cpu, so I don't know how good the SMT is on those parts.

And Ryzen supports newer memory architectures too...  it has really thrown Intel into a tizzy... Intel doesn't know what competition feels like anymore.  I am currently running 16gb of DDR4 dram running at 2999mhz.  Personally, again, that is the fastest dram that I have ever used.  I know there is faster, but I do have a limited budget (and this stuff cost me around $260 for that 16gb kit).

Oddly, the only problem I have run into with Ryzen's SMT is when using Chessbase (Aquarium and Stockfish work great).  When loading up Stockfish threads in Chessbase, if I go over 7 threads, I get a lot of cpu bottlenecking.  Mouse and keyboard stop responding and my internet becomes laggy.  So I have pretty much stopped using more than 6 threads of Chessbase.  With Aquarium, there is no problem, my cpu supports 16 threads, but I have accidentally loaded 20 threads, and no bottlenecking of any sort that I could detect... in fact it ran like that for awhile, before I noticed that Task Manager had pegged to 100%,  I checked cpu temp and it was okay, around 60C, so there was no panic, but I did cut back to 12 threads.  At 12 threads, my cpu temp is normally in the low 50s.
Parent - - By dickie (**) [gb] Date 2017-11-21 11:10
I believe both Houdini and Komodo recommend not using hyper-threading and I imagine the same will apply for other engines. The Houdini reasoning is at http://www.cruxis.com/chess/manual/index.html under Cores and Threads Management. The loss of parallel alpha-beta search efficiency mitigates the increased node speed benefit with little or no improvement to the engine's ELO.
Parent - - By pawnslinger (***) [us] Date 2017-11-21 14:37
I believe that is an old recommendation.  I believe that technology has improved since that time.  Plus  I have always believed that "The loss of parallel alpha-beta search efficiency" is sheer double talk.  Either more threads is more productive or not.  It all depends on the hardware, the implementation efficiency of SMT, and how well software can scale up to take advantage.

Now it is entirely possible that more threads cannot help infinite analysis... because of the poor ability to scale up the search algorithm in any given engine... i.e. the poor ability to break the search down into tasks that can be spread over more cores.  I do believe that early engines suffered from this, but the longer that multiple core cpu's have been available, the better chess engines were able to take advantage.  However, this does not apply to Aquarium, since it has always been perfectly able to scale the search algorithm and take advantage of many many cores or threads... 1 per engine per instance of that engine.  Aquarium squeezes the maximum efficiency out of any thread that is available to it.  Hence the fellow in this forum that is selling access to the 100 core super computer!

With the exception of the early implementations of SMT, I have always used SMT in my Chess analysis, especially with Aquarium.
Parent - - By dickie (**) [gb] Date 2017-11-21 16:17
It is the current recommendation of both latest versions Houdini 6 and Komodo 11.2 not to use hyper-threading. Both recommendations appear to be based on current testing. Aquarium facilitates engine instances running side by side but does not have any influence over engine performance. One of the advantages I find in having HT turned off is that Windows is able to manage small tasks like web browsing and email, and running Aquarium, while at the same time all the cores are allocated to engine analysis. There is no need to hold a thread or two in reserve. And if you are able to overclock, HT is definitely disadvantageous.
Parent - By pawnslinger (***) [us] Date 2017-11-21 16:53
To each his own.  My experience differs.  As I said before, SMT (or any increase in thread count) may not be beneficial to infinite analysis, depending on the ability of the engine to scale up.  But I simply don't believe this has anything to do with SMT (or HT as some refer to it).  HT is the Intel name for the feature.  As it is, on my system, I have 8 physical cores with SMT enabled, so that the cpu can manage 16 SMT threads.  This allows me to run Aquarium (with 6 single thread instances of Stockfish... I don't use the engines you mentioned), and Chessbase with a single instance of Stockfish with 6 threads working.  Plus this web browser, Firefox, and Windows Task Manager says that I have 1976 active threads on my system.  All quite productive.  Without SMT the story would be very different... better or worse, is probably a subjective judgement.
Parent - - By Ghengis-Kann (***) [us] Date 2017-11-21 18:46 Edited 2017-11-21 18:58
Not sure about Houdini 6, but Houdini 5's documentation reported hyperthreading performance that was statistically insignificant compared to turning it off.
If anything it was slightly stronger with hyperthreading enabled.
Their calculation of 25% to 30% additional node speed is very similar to the 33% I measured with benchmarking tools.

FAQ reference pasted below:

Q: I'm running Houdini on a quad-core Core i7 CPU with hyper-threading. Would you recommend to use hyper-threading with Houdini and run the engine with 8 threads?

The additional hyper-threads will yield about 25% to 30% extra node speed, but the inefficiency of the parallel alpha-beta search with the higher number of threads will partially offset this speed gain. Running with 8 instead of 4 threads will therefore produce only a small increase in Elo – probably at most 10 Elo.
If your CPU can be overclocked it may be wiser not to use hyper-threading. By not using the hyper-threading you will reduce the thermal load of the CPU which will allow you to reach a higher overclocking frequency.

To illustrate this, in a 12 vs 24 threads test match on a 12-core computer (Intel Xeon processor) the outcome after 1500 games was (+7 ±10 Elo) in favor of the 24-thread engine. In other words even with 1500 games played the measured Elo difference was still inside the error margin.


Kommodo 11 does recommend turning it off, but with really squishy language that implies it could be better to have it on under some circumstances:

We recommend running Komodo with Hyperthreading turned off on your computer, although this is debatable and may depend on your hardware. You can often find this in one of the boot up BIOS settings.

Houdini saying you can reach a higher clock speed with hyperthreading turned off is true but also misses the point because you can get more instructions per second from the hyperthreaded computer at a given clock speed that you can with it turned off.

I overclock all my computers and can tell you that thermal dissipation is not the limiting factor.
Every individual CPU has internal voltage levels at which it will fail, and overclocking requires raising the internal voltages.
This is why it's only recommended for people who like to live on the edge.

This is my go-to cooler: https://www.arctic.ac/us_en/freezer-a30.html. it's quiet, reliable, and gets the job done. If that's not hard core enough for you there's water cooling or even liquid nitrogen.

Glad to hear AMD is making a comeback.
I bought a lot of their stuff before Intel kicked them to the curb.
Parent - - By sbm (*) Date 2017-12-07 20:23
Starting from the fact that my processor that I use for chess analysis does not have multiple threads per cores, I might not be in the position to formulate an objective opinion regarding the x core y threads case. Nevertheless (notwithstanding the fact that I do not dispute the performance gains in this regard) I am more inclined toward the standpoint that the positive outcome obtained in a specific situation - of course, I'm talking about chess related aspects only - is not the achieved speed, but the ability to "solve" that task. So, an authoritative and indeed informative benchmark should focus not on plain speed (aka kN/s) but on task solving ability.

Specifically: Will a setting / accessory / fine tuning help you in the final evaluation of a task? If so, then to a really appreciable extent?

1. From the angle of hardware (of course here too some aspects are physical and some logical but there is no need to enter in such details....)
So, what influences the performance achieved? The generation of the processor, instruction set, speed in general but mostly IPS, the size and speed of the cache and of course the number of cores and ...threads.

2. From the angle of software
Unfortunately, I have no information regarding the existence of a true native Linux chess engines (except maybe Arasan...)...Process/thread scheduling in Linux is configurable, there are various scheduling algorithms and some can be configured by recompiling the kernel. Therefor I think that an engine build specifically to Linux would have major benefits over a windows version...
As far as windows engines are concerned, the most important aspect of these is their version! This is what has a decisive effect on performance....a bit more on this later
Of course, there are some fine tune settings regarding the windows thread scheduler: Priority Class, Cpu Affinity, windows dynamic thread priority boost, or modifying the I/O priority and the memory priority. Yes, this are important too, but in the majority of cases this settings influence mostly the responsiveness of the system where the engines are running.

Bottom line, what it is that I'm relating at!?
Well, all this explanations that came from the manufacturers regarding this core/threads SMT issue are focused around the fact, that some parts of the processor aren't used (they are idle) until certain other parts (mostly ALU or FPU) are occupied with a particular task of the ongoing thread. If this is the case, then this should be true for those processors which does not have SMT, moreover, it should especially be true for these. Even though this CPUs had no specific (multicore) instructions set, no specific architecture so on so fort, but it has to be valid too...Of course the specific 2thread/1core modern CPUs has all the optimization to squeez out the max they can to get a near-full utilization of the CPU hardware during pipeline stalls..after all this is multiple thread per core is all about.

Some might say that hey, to have SMT, all the optimization made in the CPUs where logical, i.e. better reorganization of tasks! Well, almost true but no quite! To be able to achieve real benefits with SMT, Intel CPUs (the hyper-thread capable ones) are almost 5% larger to have space to implement some additional structures

The logical processors have their own independent architecture state (that 5%...), but they share nearly all the physical execution and hardware resources of the processor. (I think the catch is here chess wise!)
Among others is the Instruction Streaming Buffer to hold instruction bytes in preparation for the instruction decode stage

A pretty interesting article:
https://www.pcper.com/reviews/Processors/AMD-Ryzen-and-Windows-10-Scheduler-No-Silver-Bullet

Btw, because the main problem with multithread is the cache context, there might be situation when more threads (2) can thrash the cash, but one thread would not..so the 2 thread job might run a bit slower overall than the 1 thread job....Mostly not, but sometimes...yes..So, to see if this is true or not I do made a little test. Very simple and straight forward. Every sequence where run for 1 minute and 10 seconds....10 seconds just to make sure that initialization is done, no time wasted ...

Here I used Stockfish as engine (and first a 4 Core only -no SMT capable- CPU at 2.6GHz speed)

CPU AMD Phenom/4Cores     Speed 2,6 GHz Engine   Stockfish 8      Hash Size  2048 MB Running    1 min 10 sec
     Core/Thread  Depth  kNodes  kN/s 
            1              25  46893  723 
            2              26  91283  1362  something near linear in increase
            3              28  139808  2072  the higher watermark in achieved depth...increased total kNodes and higher speed too!
            4              29  177504  2492  the higher watermark in achieved depth! maybe the depth threshold to even higher depth was near....speed and kNodes higher then before
            from now on there are no more free (well...sor to say) cores...       
            5              27  171478  2497  smaller depth with decreased total kNodes but somehow the same speed!
            6              27  175651  2549  depth still under watermark with increased total kNodes and speed!
            7              26  141725  2552  smaller depth and significantly decreased in total kNodes but even higher speed!?
            8              25  167182  2563  depth as small as with 1 thread...total kNodes somewhere around the 4 to 5 threads level, and with the highest achieved speed!

  after each and every run, I flushed the hash file and unloaded the engine to be sure that it starts from scratch.                 

just for fun:       
           32              23  177692  2833  even smaller depth with highest watermark total kNodes and highest speed from all!
          127             22  188886  3095  depth as small as with 1 thread...total kNodes somewhere around the 4 to 5 threads level, and with the highest achieved speed!

And this all links back to problem solving too....smaller achieved depth = decreased problem solving capabilities.

I ran that 100 position test because I was very intrigued by the fact that in the situations, when the very same hardware system of Sedat outperformed when the HT was OFF!
No speed measurement, just plain problem solving!

65    2x Intel Xeon E5-2686 v3  2.00  OFF   32    x64  Sedat Canbaz
63    2x Intel Xeon E5-2686 v3  2.00  ON    64    x64  Sedat Canbaz

48    Intel Core i7-980X        3.33  OFF    6    x64  Sedat Canbaz
46    Intel Core i7-980X        3.33  ON    12    x64  Sedat Canbaz

Well, I wanted to see this with my on eyes ..sor to say :-) So I run this test on the one laptop that we have in our household and it is equipped with an i7 Haswell processor 4 cores 8 threads. First, the test run with HT off so 4 cores only: 33 solved out of 100. Second, the test run with HT on so 4 cores 8 threads: 29 solved out of 100. During the test there are 30 seconds allocated for each position. For this I used the exact setup from Sedat test suit...which has Komodo 9.0.2 64bit

To summing up this tests>
Analyzing engine: Komodo-9.0.2-64bit 4 cores 4 threads(HT off)
Level: 30 Seconds/position
Total time spent : 41:21 = 2481 Seconds
Result: 33 solved out of 100 (the engine spent 67x30 seconds=2010 seconds with the unsolved positions, and 481 seconds on the 33 solved positions, avg. 14.5sec/position)

Analyzing engine: Komodo-9.0.2-64bit 4 cores 8 threads(HT on)
Level: 30 Seconds/position
Total time spent : 41:57 = 2517 Seconds
Result: 29 solved out of 100 (the engine spent 71x30 seconds=2130 seconds with the unsolved positions, and 387 seconds on the 29 solved positions, avg. 13.3sec/position )

It isn't some random issue...it does solved less positions!

Then I get back to the "old"/no SMT capable CPU computer and I ran the test with two different versions of the engine

Analyzing engine: Komodo-9.0.2-64bit 4 cores 4 threads
Level: 30 Seconds/position
Total time spent : 44:41 = 2681 Seconds
Result: 22 solved out of 100 (the engine spent 78x30 seconds=2340 seconds with the unsolved positions, and 341 seconds on the 22 solved positions, avg. 15.5sec/position)

Analyzing engine: Komodo-11.2.2-64bit 4 cores 4 threads
Level: 30 Seconds/position
Total time spent : 33:14 = 1994 Seconds!!
Result: 46 solved out of 100 (the engine spent 54x30 seconds=1620 seconds with the unsolved positions, and 374 seconds on the 46 solved positions, avg. 8.1sec/position)

Wau! The jump from the engine version 9 to the latest (that I have), the 11th, produced a huge performance leap!
46 solved out of 100, which is more than double the number of solved position compared to the 22 solved by the older engine! and not only that, but it did it with a huge gain in speed too!

Now I wanted to see the gains with the SMT capable 4 cores 8 threads CPU (and more recent architecture too) again with this Komodo-11.2.2 version too!

Analyzing engine: Komodo-11.2.2-64bit 4 cores 4 threads(HT off)
Level: 30 Seconds/position
Total time spent : 28:45 = 1725 Seconds
Result: 56 solved out of 100 (the engine spent 44x30 seconds=1320 seconds with the unsolved positions, and 405 seconds on the 56 solved positions, avg. 7.2sec/position)

Analyzing engine: Komodo-11.2.2-64bit 4 cores 8 threads(HT on)
Level: 30 Seconds/position
Total time spent : 30:36 = 1836 Seconds
Result: 54 solved out of 100 (the engine spent 46x30 seconds=1380 seconds with the unsolved positions, and 456 seconds on the 54 solved positions, avg. 8.4sec/position)

Of course here too are huge gains comparing the results of the two versions of the engine, but the result oriented performance decrease between the 4c4t vs. 4c8t it is about the same magnitude!

And if I started to make tests, I did the same number of threads, depth and speed test on the Intel Haswell HT processor too>

Intel Haswell 4cores/8threads  3,1GHz  Stockfish 8  2048 MB  1 min 10 sec
 
            1  28  84031  1262
            2  28  163093  2389
            3  29  213425  3344
            4  28  224114  4133
            5  28  297715  4764
            6  28  347130  5480
            7  28  330442  5838
            8  29  407909  5792

Some sort of conclusion might be:
- the core increment does not imply a linear increase in achieved depth or speed for that matter
- the gains in speed (engine wise and not hardware wise!) and in kNodes does not necessarily means an increase in depth, say faster solution (room for optimization in engine...)
- for the moment it seems that there might be way bigger reserves of potential in the software (development) than in the hardware
- the real difference between otherwise two identical CPUs but with SMT capability only for one of them (if there where this days such CPUs..but they aren't..there are no Celerons in this matter :-) ) is (in my opinion) the increased productivity when working with multiple tasks on multiple projects. Then You have a really big gain because one can work on several projects in the same time, and still having the responsiveness of the system at a fine level. Otherwise there is no increase in the solving capability of an engine.
Parent - - By Ghengis-Kann (***) [us] Date 2017-12-07 22:18
Interesting work, but I'm a bit confused by how the results are being reported.

Taking less time to solve each position should result in more positions being solved in a given time.

Your results show more time used per solved position in cases where more positions are solved.
Parent - By sbm (*) Date 2017-12-08 09:34
Hm...now I'm the one who is confused :confused:

The engine has 30 seconds at his disposal to solve (or not) every position..regardless of how many positions he solved already. Example: if there are only 2 position to solve, the time is 30-30 sec. for each. If only one position is solved (first or second does not matter), and did it in say 1 second. then the whole test will end in 1+30 = 31 seconds. If both position are solved, one in 1 second the other in say 14 the whole testing will end in 1+14=15 seconds. If none are solved then the whole test should have run for 30+30 seconds. But if every one is solved at the limit, in exactly 30 seconds, then the test will end in 60 seconds, same time as in case when it does not solved a single position.
Parent - By pawnslinger (***) [us] Date 2017-12-07 22:45
I appreciate your analysis.  But I find it hard to generalize a conclusion from your testing.  It has been my experience that SMT/HT has changed over time.  In other words, 8th gen Intel Core processors have a more efficient implementation of all aspects of the cpu, including how SMT/HT is achieved.  So I would expect much better performance from an i7-8700, than say an i7-2600.  But this is just my feeling, nothing that I can even test.  When I upgraded from a 2nd gen i7-920 to an R7-1700, I noticed a lot of improvement in system efficiency -- I believe both in single and multi threaded work loads.  However, this is very hard to be objective about.

I have also noticed that some software handles SMT/HT better than others...  for example, using more than the physical core count with Chessbase tends to make my system less responsive.  But this same issue does not happen with Aquarium.  In fact, with Aquarium I am able to use ALL cores (real and virtual) and see Windows Task manager at 100%, and my system still remains very responsive.  Why?  What is the difference between the 2 programs?  With Chessbase I keep it running with 6 or less cores (my system has 8 physical cores).  With Aquarium I am able to use all 16 cores and still maintain system responsiveness to do other things (like browse the internet while Aquarium bangs out work in the background).

So in conclusion, I think it is very hard to generalize from the limited testing that has been done.
Up Topic Rybka Support & Discussion / Aquarium / several single-core engines vs. one multi-core engine

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill