Not logged inRybka Chess Community Forum
Up Topic Rybka Support & Discussion / Aquarium / Aquarium2020 on Threadripper 3990x
1 2 Previous Next  
- - By centipawn (*) Date 2020-02-10 16:09 Edited 2020-02-10 16:11
To those of you wondering what Aquarium will do on a Threadripper 3990x: I am lucky enough to have one (due to work - not entirely to have fun with chess, sadly). I configured it with Stockfish 11, 32 MB Hash per instance, 1 thread per instance, for IDeA.

I previously used Aquarium in a similar configuration on a 16 core Threadripper with 24 instances, and 16 additional remote instances. This worked fine and was fairly responsive.

On the 3990x, which has 64 cores, I find Aquarium gets fairly unresponsive in the GUI. It works, and will use all configured instances (I tried with 16, 32, 64, 100). But beyond 16, the GUI really lags (seconds), which is weird. I tried upping the priority of the GUI process to 'realtime', but even that did not help. I'm not clear currently what the bottleneck is, but it is certainly bad enough to take the interactivity out of IDeA.

The odd thing is: even when only 4 of the engines are busy, the GUI becomes pretty unresponsive. Something is amiss here... will investigate.

Having said this - this report is just the result of 4 hours of fiddling. Just wanted to put it out there as I imagine some of you may consider upgrading to the 64 core beast.
Parent - By pawnslinger (****) Date 2020-02-10 21:19
I noticed this on my system too... when I first moved to Zen type chips.

I had to up my depth of analysis, because the chip was cranking out TOO MANY evals per second.... the backend couldn't keep up and so the GUI slowed down.  I would increase my depth of analysis by 25 percent and see if that helps.  You can always fine tune for your hardware.

By the way, my backend runs to a fast SSD, so that wasn't the problem.  Just in case anyone was wondering.  And my system has plenty of ram at 3200 mhz.  In order to keep things smooth, I had to add about 40 percent to my analysis depth (Zen is fast and the 3990X appears to be the fastest).
Parent - By mattchess (**) Date 2020-02-11 01:31 Edited 2020-02-11 03:57
Some day maybe...some day :P  Quite envious!  Keep us posted on your optimization
- - By centipawn (*) Date 2020-02-11 14:52
After some more fiddling with it, a little update:

Firstly, the lack of responsiveness only happens for Aquarium. Even with 120 Stockfishes calculating away, other programs remain fully responsive with no discernible lag.

Secondly, Aquarium has a serious problem in this setup. I repeatedly ended in a situation where an IDeA project would show one or two tasks in progress (currently I have one that it says has been running for over 90 minutes - my setting is a max time of 60 seconds), but does not show any engine busy (nor is there, as far as I can make out), and that is where it never recovers, effectively stopping completely (waiting forever for this task). This type of situation happens sooner with more engines configured - with 120 engines configured, it will happen within 2 hours usually.

I can only make a wild guess at this point: maybe Aquarium has a bug in the threading / thread synchronization (more likely) and/or communication logic with UCI subprocesses (less likely) causing it to occasionally disregard a task that is finished (or should be stopped due to time overrun). Having said this, I am a little surprised that this never showed on a 16-core machine with 24 local and 16 remote engines configured. Odd. Disk or memory speed/size are definitely not the issue here (3200 MHz CL 14, NVMe - Samsung Evo Pro).

However, at present this makes the 3990x effectively useless for Aquarium. It is lightening fast if you don't mind babysitting it to restart when it hangs, and the UI locking up. If I want to avoid either, I have to dial down the engine setting to such a low number that the 16-core machine is actually faster.
Parent - - By Ghengis-Kann (***) Date 2020-02-11 17:49 Edited 2020-02-11 17:54
There is something strange going on here.

I am running Aquarium just fine on a 3950X, which has exactly the same architecture as far as I know except it only has 2 of the 8 core chiplets instead of 8 that the 3990X has. (edited to add that I am aware of other differences like the number of PCI express lanes but nothing relevant to running Aquarium).

There are different versions of stockfish (popcnt, and some others that I can't immediately remember), my guess is you are running the wrong one.

I also recall from the Komodo readme that there is a particular instruction that is not well executed by the new ryzen chips.

Here it is:
NOTE: On AMD Ryzen CPUs you should not use the BMI2 version of Komodo. The Ryzen implementation of the PEXT instruction used in the BMI2 version is very slow, so please do not use the BMI2 version on those machines.
Parent - - By centipawn (*) Date 2020-02-11 18:24
It's not that. Stockfish does just fine (and yes, I am using the BMI2 version with the PEXT). My "normal" PC is a Threadripper 2950X with 16 cores (similar to your 3950X, only previous generation) and Aquarium does fine there with the exact same executable of Stockfish.

If it was Stockfish related, then performance would just be bad, but that's no reason why the GUI of Aquarium would become unresponsive.
Parent - - By Carl Bicknell (*****) Date 2020-03-26 21:40

> t's not that. Stockfish does just fine (and yes, I am using the BMI2 version with the PEXT)


I thought with Zen you should use popcount
Parent - - By centipawn (*) Date 2020-03-27 06:40
Indeed, and I have switched to that since!
Parent - - By Carl Bicknell (*****) Date 2020-03-27 10:28

> Indeed, and I have switched to that since!

And did that solve the problem?

On another forum someone wrote: "BMI does horrible things to Zen..."
Parent - - By centipawn (*) Date 2020-03-27 13:50
I did not solve any problem, it just runs faster.
The main problem is still that Aquarium fails to realize a task is done when I configure e.g. 60 engines with 1 thread each.
I am now running 30 engines with 4 threads each. Not ideal, but at least stable.

<joking> Maybe a high BMI is bad for karma... not sure what Zen has to say about that ;-) </joking>
Parent - - By Carl Bicknell (*****) Date 2020-03-27 22:49
This is very strange because I've had around 100 engines running (one thread per instance) and many of those were remotely - without issues. It does sound like something is wrong.

Are you analysing at a fast TC?
Parent - - By centipawn (*) Date 2020-03-28 10:28
No. The absolute minimum depth I use is 30, and mostly, I use higher depth. I don't think this counts as a fast TC.

It works for a while, but eventually, it will hang because it keeps waiting for a task that did finish. But it is very odd that this does not happen on other CPUs, even when using many more engines. Interestingly, it also happens if I run Aquarium on the 3990x, but all engines are remote. Weirdly, it also happens if I run Aquarium on a different PC, but the remote engines on the 3990x. I work in IT and know quite a bit about multithread / multiprocess / distributed programming etc., but I can't come up with a theory that would explain all these facts.
Parent - By pawnslinger (****) Date 2020-03-28 16:50 Edited 2020-03-28 16:53
This is a classic "deadly embrace" situation.  The pipe from the engine to the Aquarium instance somehow is losing some of the engine output.  The cause of the loss is a mystery, it could be any number of things, so it is hard to troubleshoot, especially because it only happens sporadically (symptom of classic "deadly embrace" situation).  Once the contents of the pipe are somehow dropped, Aquarium is left waiting for an event that has already passed it by.  As a retired programmer that has dealt with this class of problem, I know this is probably first and foremost a bug in Aquarium... the waiting code should have a timeout to break any possible deadly embrace... Aquarium clearly has no such code -- therefore a bug.  It is harder to say which program is responsible for the pipe losing the engine output... there are lots of potential areas of interest.  But I suspect that it is a Windows problem.  I suspect Windows ability to buffer these pipes is being exceeded... resulting in another classic bug... the "buffer overflow".  Buffer Overflow bugs are notorious for crashing systems and Windows has had many many of them... and they still exist today, usually they are exploited by hackers to gain illicit entry to systems or to crash them.  The first time I ran into such a bug was pre-Windows, in another Microsoft product, the Basic interpreter.  Microsoft truly has a blind-spot for this type of bug.  Once triggered, the buffer overflow in Basic would dump the user out to the system prompt, giving potentially unauthorized system level access.

The first thing I would think of to try and stop this would be to find a way to increase the size of the buffers in question.  But hackers know that as long as the buffer is finite, it can be overloaded.  So a permanent fix involves changing how the buffer is managed (a task only Microsoft can easily do).  When confronted with this sort of problem the fix has been to rewrite the code in such a way that it doesn't use any of the Microsoft buffer handlers.  Not very practical in this case, since none of us are Aquarium programmers (at least I don't know of any).
Parent - - By pawnslinger (****) Date 2020-03-28 17:20
I just did some more digging... it seems that stdout defaults to a 4kb buffer size (I don't know how to change it).  So is there some way to change the verbosity level of the engine??  Maybe the engine is occasionally too verbose and the buffer overflows.. 4kb doesn't seem like a lot of buffer space to me.

I bet the UCI protocol allows some sort of verbosity control... if it can be accessed from Aquarium is another question entirely.
Parent - - By centipawn (*) Date 2020-03-29 06:50
You could be right, Pawnslinger. But that still makes it an Aquarium problem.

I wrote a little utility for my own usage that basically stores a chess tree in a database and runs engines, similar in that aspect to what IdEA does (but without GUI and a simpler algorithm for prolongation and alternatives - more a research project than something for actual CC usage at the moment). It can cope with 125 engines on the 3990x and an additional 100 remote engines, no problem. So it is not an intrinsic problem of the platform.
Parent - By pawnslinger (****) Date 2020-03-29 07:50
Yes, I agree.  An Aquarium bug.  But since many of us wish to use Aquarium, and we cannot control the Aquarium algorithm... we (Aquarium users) are forced to find workarounds.  Even just with a 1700x, 8 cores, I myself run into this problem - about once a month on average.  So it is not a huge problem for me now, but I am planning to upgrade my cpu in the near future, so the problem may become more of an issue.
Parent - - By Carl Bicknell (*****) Date 2020-03-28 22:42
Very strange.

Have you tried clearing out all the cache from every conceivable place in Aquarium - I.A etc etc ?
I'd be very interested to see if this happened with a clean install of windows and a fresh install of Aquarium with no previous trees. If it does there's nothing you can do.
Parent - By centipawn (*) Date 2020-03-29 06:51
Yes, I have tried that. My 3990x was brand new and Aquarium was the first thing I ran on it after installing Windows. And yes, it was a completely fresh Aquarium install too. I have reinstalled Aquarium a few times from scratch to test various scenarios as well. Problem always happens. It does not happen immediately, but it always did within 6 hours.
Parent - - By pawnslinger (****) Date 2020-02-11 18:34 Edited 2020-02-11 18:57
You are having the EXACT problem that I had.  Yes, exact.  The slowdown only affects the Aquarium GUI - NOT any other program running on my machine.  They all run fine, even when the Aquarium is sluggish as heck.

Did you try my suggestion?

Give it a try.  With 120 engine instances running, I am sure that you are funneling too much too fast into the Aquarium backend.  You either have to cutback the number of engines or (sadly) find someway to slow them down.  Ironic is it not?!  We all want faster CPUs, but Aquarium simply cannot handle the quantity being produced.

I don't know what depth you are using, but for the sake of discussion, lets assume you are using ply depth 25... increase it to 38 ply.  That should give you much higher quality and much slower speeds.

And that is the benefit of using a 3990X !!!    A rather huge leap in Quality.

There is a way to use such a massively parallel CPU without slowing things down (sounds odd to say that).  And the way you can do that is to run multiple copies of Aquarium.  On my system I can run 5 copies of Aquarium at faster speeds.  Ironic?  Yes!  So you could run 10 Aquariums giving 12 engine instances to each and you might be okay.  In this way you are still able to run 120 Stockfish instances, but spread out over 10 Aquarium backends.  I mean, I wouldn't do that myself, I would just take my higher quality and go for my GM title.
Parent - - By centipawn (*) Date 2020-02-11 19:29 Edited 2020-02-11 19:31
Hi Slinger of Pawns, I did read your suggestion in this forum. My "usual" setting is a depth of 30 AND minimum time of 60 seconds, with a maximum of 120. I will try increasing it and see what that does.

(edit to insert the following:) What makes me doubt your theory is that even when starting IDeA with my settings, the GUI is terribly slow from the first second - even though the first results will not come in for another 50-odd seconds.

Ultimately, I aim to replace Aquarium with a little piece of custom software I am working on. Far from finished - but I did already run a little test harness with 120 instances of Stockfish working away and logging results (not into a proper DB so far). There's no problem there, it is as fast as you would expect. So it's definitely not a Stockfish problem.
Parent - By Ghengis-Kann (***) Date 2020-02-11 20:24
Have you tried running remote engines on the 3990X with RTHomeServer406?
If it is a GUI problem then you can still harness the full power of it by running Aquarium on a different machine.

Not an entirely hypothetical question for me because I plan to go that way in the not too distant future, but will wait for the 4990X to get the cache improvement and the added efficiency from the whole thing being built at 7 nm.
Parent - - By pawnslinger (****) Date 2020-02-11 20:26 Edited 2020-02-11 20:29
Yes.  I agree -not a Stockfish problem.

A ply depth of 30 is pretty good, but in your setup, I think it is still too low.  If you have Stockfish producing 120 evals per minute,, 120 instances of Stockfish each producing 1 per minute, that would be 7200 per hour., if my math is correct.  In my opinion, you want to get that below 1000 per hour.  That is the maximum I use anyhow.

In actual usage, I keep it way way under 1000 per hour.  I cannot imagine the Aquarium backend handling 7200 per hour.  I wish it could.

If you are writing the results of your own Stockfish handler to a flat file... yeah, it can handle 7200 per hour.  A flat file is much different than a db backend - which has a lot more overhead involved.  I wish I knew how to calculate the exact number of transactions that a Zen cpu can handle.  I expect there are lots of variables, like memory speed, memory size, buffer sizes, disc speeds, etc, etc, etc.  And I think Ghengis may be wrong about the PCIe not making a difference... if your backend is going out to NVMe, then the speed of your PCIe will definitely figure into the formula.  And of course, then there is the coding in Aquarium itself, which I suspect has not been developed with hugely parallel cpu's in mind.  Nor with very fast cpu's in mind either.

In my opinion, the development (which is no longer being done) of Aquarium was done to take advantage of the least common denominator cpu, think i7- four core 3rd generation Intel, and I think you would be close to the mark.  Zen with 64 cores is way way beyond anything that Aquarium was developed to handle.
Parent - - By Ghengis-Kann (***) Date 2020-02-11 20:55
PCIe speed is the same for a 3950X vs 3990X, but the Threadrippers have more available lanes.
In any case chess analysis does not appear to be IO bound, but rather a direct function of clock speed and instructions per clock.

My Xeon 2650 server is i5 level technology with 16 cores, 2.6 GHz clock speed, DDR3 1600 and PCIe3. The 3950X runs 3.7 GHz on all 16 cores, has DDR4 3200 and PCIe4.

My engine balancing experiments determined that 5 threads of the Xeon equals 3 threads of the Ryzen, which is completely accountable by the higher clock speed and improvements in instructions per clock. The much faster memory and bus speeds are contributing essentially nothing.
Parent - - By pawnslinger (****) Date 2020-02-11 21:14
You forgot to add... "in your case".

I think that a more efficient cpu can bottleneck lots of different things.  Including the PCIe bus.  What happens with lower cpu's (like mine) do not always translate well to higher end cpu's.

On my cpu (1700X) I cannot bottleneck my PCIe bus.  So you are right.  On some other setup with a faster cpu, and faster NVMe, a slow PCIe bus could possibly bottleneck.  Especially depending on the other components in the system.  It is very hard to make a generalization, when I don't have his system to test it on.
Parent - - By Ghengis-Kann (***) Date 2020-02-11 21:27
This type of analysis will never saturate a PCI Express bus.

It's like driving a donkey cart down an 8 lane highway.
Parent - By pawnslinger (****) Date 2020-02-11 23:32
You know, you might be right.

But in my system, Chess analysis is seldom all that is going on.  I don't even begin to know how to test whether the PCIe bus is bottlenecked or not.  I do know that the general view is that modern video games, even using high-end GPUs, do not bottleneck it.  And they use a lot more bandwidth than Chess analysis.

However, there must be a reason why they developed PCIe 4.0!?  If PCIe is not a bottleneck, why jump to 4.   Oh yeah, I guess it must be because more and more things are hanging off that bus everyday.  That could be it.

So you could be right.  But I think it must be a case by case analysis.  Depends on the system and what else might be going on in it.  If just Aquarium and Stockfish, I am positive you are right.  But it doesn't happen that way in my system.  For starters, I run a video server on this machine, Serviio... it is almost constantly playing videos for someone in my home.  That runs in the background 24/7.  Then there is whatever I am doing personally (this is my workstation), the Aquarium analysis just hangs around in the background most of the time... I play videos and normal games in the foreground (I am retired, lots of time on my hands).  And this browser has 3 tabs open right now, usually YouTube and news type sites.  And all runs smoothly even if Aquarium is bogged down... so I am sure that the PCIe bus on my system is more than adequate.

So yes, you are probably correct... in my case.... and in yours.  But I haven't a clue about "centipawn" and his system (except that it is a really fast system).

You know a funny thing (I like to say that <grin>), I started having these type of GUI issues with Chessbase a couple of months ago.  I Google'd around and found that I was not the only one with such CB issues.  I started having the problems after installing Eman.  Turns out I had to tweak a few things in CB to get it back on the smooth track.  So Aquarium is not alone with this issue.
Parent - By cma6 (****) Date 2020-03-28 02:55 Edited 2020-03-28 03:01
"My engine balancing experiments determined that 5 threads of the Xeon equals 3 threads of the Ryzen, which is completely accountable by the higher clock speed and improvements in instructions per clock. The much faster memory and bus speeds are contributing essentially nothing."

   Ghengis: Does this imply that my dual Xeon system with 2 X 18= 36 cores would be equivalent (on a Ryzen 3900 series) to 3 X 36/5 cores or 21.6 Ryzen cores, in which case a Ryzen with more than 22 cores would be faster for chess than such a dual-Xeon system?

" will wait for the 4990X to get the cache improvement and the added efficiency from the whole thing being built at 7 nm."
    I can't find any info on TR 4990X. What have you heard in terms of cores, etc?
Parent - By pawnslinger (****) Date 2020-02-11 20:36
I know this sounds crazy... but the 3990X is totally unprecedented on the desktop.

According to my rough calculation (very rough) I think you should try to raise your depth from 30 to 45 ply.  That will slow it down enough, I think.

You can always do fine tuning.
Parent - By pawnslinger (****) Date 2020-02-11 20:42
The really funny thing... 45 ply may not be enough!!!!  Your cpu might be able to reach depth 45 in 60 seconds.... in which case you would still overload Aquarium, IMHO.  You should try to slow it down so that 120 Stockfish are producing 1 eval in about 6 minutes (or something like that).  Even at 6 minutes per eval, you still would be over 1000 per hour.  Man oh man, that is the kind of problem I would love to have.
Parent - By pawnslinger (****) Date 2020-03-27 06:05
Another thing that just popped into mind... Aquarium can be slow from time to time, especially if you have a large event queue and/or a large number of IDeA projects (or large projects).... because these files get backed-up at various times.  And a large event queue is a real slow down, as it is a "flat" file, so it takes a lot to add/remove entries from it.  I used to go as high as 50k entries in the queue, and it was a real slow down for everything, even the GUI update/mouse clicks etc.
Parent - By cma6 (****) Date 2020-03-28 03:13
   When doing infinite analysis, I see on Ipman chess that all the speedsters with great TR 3900-series results use threads = 2 X no. of physical cores (and SF popcount). However, that is for infinite analysis.

  I was under the impression that when using Aquarium/IDeA, one should not use threads but only physical cores, e.g., on my dual-Xeon master system with 2 X 18 = 36 cores + another 20 cores available on LAN, I can comfortably use 55 of the 56 cores (at least overnight) in IDeA without any problems, leaving only one core free on the master system for Aqr/Win10. Why should Ryzen TR chips be any different than a Xeon? Perhaps Aqr/IDeA is happy only with physical cores but not with extra threads?
- - By centipawn (*) Date 2020-02-12 08:19
Wow, quite a discussion here - I will try to provide feedback / comments on all suggestions made in this consolidated post, rather than sprinkle 15 replies over the board.

Regarding "use higher depth or Aquarium can't cope": No, does not help. I tried it over night with 120 Stockfish instances, minimum time 600, maximum time 900, depth 42, which should - due to minimum time 10 minutes - ensure that no more than 1 result every 5 seconds comes in (on average). It got stuck fairly quickly, again waiting for the last task in the queue to finish - except that was not running anymore. After that, no progress whatever happens until I kill it. Note  that on my 16 core machine, I had about 1 result every 3-4 seconds come in with no problem.

I also tried activating 9 projects, with 112 Stockfishes configured in 7 blocks of 16 each (just to see if that makes a difference). Just after activating IDeA, the GUI freezes, and it took 7 minutes for it to react to a button press (on Stop IDeA).

Remote engines - yes, that crossed my mind too. Main reason I have not tried it yet is that I have to figure out first how to automate setup, as manually configuring 120 RtHomeServers & 120 remote UCIs in Aquarium is about as much fun as having your tongue stapled to the floor.

Re: Database speed, bottlenecks PCI throughput etc. I have quite some experience with all that (more than 20 years in IT now), and I share the opinion that Aquarium was never built for this kind of usage (how could it have been, back then - it clearly has not been modernized thoroughly in recent years). What we are observing is very far from the actual limits hardware and software impose these days. I have seen several databases holding billions of records and taking in tens to hundreds of thousands of records every hour (financial market tick information is one example). Desktop installations of SQL DBs such as PostgreSQL, some noSQL-DBs such as Mongo, and certainly specialized databases (many exist for fast logging) can easily consume hundreds to thousands of records per second with little sweat if configured and used wisely. PCI lane numbers and speed are of little concern here. The main reason why PCI 4 exists is because when modern video games start up, they need to transfer a great deal of data (GBs, typically) to the GPU's memory, which causes a delay in game startup. This is where these transfer speeds matter. For chess - even for LC0 - they are irrelevant. In fact, for chess, CPU is the only thing that matters. I did wonder if, with 120+ engines running, memory bandwith for hashing would become an issue. Based on my measurements, it is not - I see 100% CPU utilization. If memory was throttling computations, this figure would be less.

To elaborate further on the DB issue. With an Aquarium style use case, the challenge is not really taking in the data as fast as an engine can produce it. The challenge is how to organize that data so that minimaxing is efficient, threefold repetitions in move sequences are easy to detect, and (ideally) the 50-move draw rule can be observed, all the while ensuring that transpositions don't result in duplicated work. There are several ways to tackle this, but none are trivial. Regardless, I think with modern hardware and software it will certainly be possible to handle at least tens of results coming in per second (I think the hard limit is much higher). Again - not a criticism of Aquarium, which was designed when the kind of parallelism we have today did not exist.

Whether or not increasing the depth for the engine, or have shallow depth but a much deeper tree, is probably a matter of opinion (and I have not settled mine yet). Personally, in my own software, I want to go for a model that uses at least two different engines and adaptive depth. Basically, one of the rules will be: if all different engines agree fairly quickly that the move is bad, accept that it is. If, say, Stockfish says it is +0.9, and Komodo says it is -0.1, then keep increasing tree depth until it is clear which one is wrong. Evaluation depth will also be adaptive. My main gripe with just increasing Stockfish depth is that it will sometimes not look at moves that I want it to look at (often the type of moves that make a game interesting and turn out not to be bad at all - and usually those I add manually in interactive analysis. 99% of the time the engines will tell me I made a blunder, but the remaining 1% have turned out to be moves that can be decisive).

However - to put this in proportion: I am by no means a chess expert. I suck OTB as I will blunder frequently, and have only started playing correspondence fairly recently. As far as software/hardware is concerned, I usually know what I am talking about, but as far as chess is concerned, take my thoughts with a grain of salt (or three).
Parent - - By pawnslinger (****) Date 2020-02-12 09:20
Interesting.  I am surprised that running multiple instances of Aquarium didn't help.  This makes me think that some sort of Windows issue is being pushed to a limit.

I know when I started running multiple copies of Aquarium, I had to setup separate folders for each Aquarium instances' engines.  When I tried running off 1 engine folder, things were very sluggish.  So in my setup I have:


Repeat N times for B, C, D, E etc....

So I have N copies of Stockfish, Eman, etc.  N copies of Aquarium, etc.  This is one reason I do not use engines with copy-protection, I ran into lots of trouble with them.  Especially Houdini... it did not like me running 5 copies of it AT ALL.
Parent - - By centipawn (*) Date 2020-02-12 09:31
Sorry pawnslinger, I should have explained better what I meant with 7 blocks of 16 engines. I did NOT try multiple Aquarium instances (because I don't find that particularly helpful for my usage).

What I did was this: I set up 7 local UCI engines all pointing to the same Stockfish executable. I configured all 7 for IDeA usage, each with a multiplicity factor of 16. I did this to see if it would make a difference - Aquarium juggling 16 instances each of 7 "separate" engines, or juggling 112 engines of one and the same. It doesn't. It was just a sanity check - it could conceivably have made a difference if synchronization in Aquarium was bound to the configured UCI engine(s). I'm not too surprises that this is not the case.
Parent - - By pawnslinger (****) Date 2020-02-12 10:11
I am not sure I understand what you did, but I think you wrote that you setup 7 engines using the same executable?  I think that wouldn't work, if my experience is anything to go by... I tried something similar when I began with Zen, and it appeared to me that Windows was only loading 1 copy of the executable into memory.  I could be wrong, but that is the way it looked to me.  That is why I broke it out into separate folders like that... this forced Windows to treat each as a different executable.  For whatever reason, with only 1 copy of Stockfish being in memory for 14 engines (that's my maximum engine count I use) Stockfish was getting only about 100 knps.  Once I split them out, running 14 engines I get 1200 knps per engine.  Quite a difference.  At first I didn't realize the problem, but thru trial and error, I worked it out.  So I now use the setup you see above with one extra quirk... since I use 14 Stockfish copies per Aquarium instance... I actually have a grand total of 70 Stockfish copies on my SSD, each with their own folder.  Needless to say, it takes me awhile when I want to upgrade Stockfish <grin>.
Parent - - By centipawn (*) Date 2020-02-12 10:31
Normally, I configure Stockfish as one UCI engine on Aquarium's "Engines" view, with a setting of 1 thread. In IDeA view, using the "Engines" button at the top, I configure this Stockfish engine to be used for IDeA, and set a multiplier - up to 120 in my tests.
What I did different yesterday was that I configured seven UCI Engines for Stockfish in the "Engines" view, and in the "Engines" dialog of IDeA view, I added them all, each with a multiplier of 16.
All these UCI engines pointed to the same executable, all in the same location. I have never had problems with that for Stockfish.

On my 16 core machine, I usually have 24 engines configured for IDeA, but I have two more Stockfish(es?) set up as UCI engines for engine matches - also pointing to the same executable, in the same disk location. I am able to run engine matches while IDeA is working away with this setup, and it has never caused a problem.

This may not be true for other engines. It certainly won't work for engines that use their storage location to write data to (persistent hashes or similar), or that lock any files they read from there.

Windows will still start a separate process for each Stockfish instance Aquarium fires up, i.e. there will be separate copies in RAM, each with their own working data, regardless if they come from the same disk location or not (this is a slight simplification - for any DLL's referenced by the EXE, Windows will only keep one copy of executable code in memory, but any working data will be duplicated and separate for each process, so this does not matter in practice). I'm not clear why your encountered performance differences when branching out into folders - it sounds odd to me.

Incidentally, PowerShell can help you automate things like "copy this exe to those 70 folders". I use it too seldom to know off hand how to do that - have to google it every time, so I resort to hand copying for anything < 20 regardless...
Parent - By pawnslinger (****) Date 2020-02-12 17:11
I don't know how our setups differ in all details... but of this I am sure.  I have tried the multiplier you mention, and again all it did was slow things down.  How can I tell?  I click on an event and Aquarium will tell me the knps I am getting.  -- when it says 100 or less, that is slow.

Then when I look in the Task Manager of Windows under Details and it shows only 1 entry for Stockfish, I assume that means only 1 copy is in memory.  Maybe I misinterpret what Windows Task Manager is telling me, but there is no mistake when Aquarium tells me something like 100 knps.  That is just slow.  My testing was done about 3 years ago, perhaps in the meanwhile Windows has fixed some holes that it had.  I haven't tried this experiment in a few years.

The problem you are having with an event hanging sounds like a bug in Aquarium, perhaps caused by a table overflow someplace.  Where Aquarium can't handle so many engines at once.  You'd think the program would have been more defensively written, to not allow such a problem.  But clearly Aquarium still has problems... understandable  problems, but problems none the less.  When folks wrote Aquarium they never thought that one day a 64 core CPU would be used to run it.
Parent - By pawnslinger (****) Date 2020-02-12 17:17
To prove this theory, of too many engines, you could try this... on the 3990X run Aquarium with 14 engines, see what happens.  I know Aquarium can handle that many.  If it works okay, then double it to 28 and in this manner see where it starts to hang.  Just run a few tests to pin down the problem a little.

Of course, if it can't handle 14, then something is wrong with the CPU or the system itself.  Aquarium can handle that many, I am sure of it.
- - By centipawn (*) Date 2020-02-12 13:08
Another update - I tried with the remote setup: 128 RTHomeServers on the 3990x, and Aquarium running IDeA on my 16-core machine using the 128 remote Stockfishes.

Not a good idea. It worked for (literally) 4 minutes, then it again began waiting on the last task in a cycle, far beyond the time limit set, even though this task was clearly finished already.

Has anyone ever had Aquarium run successfully with > 64 remote engines configured and not had this problem?

Looks like I will have to follow Pawnslinger's suggestion to use multiple Aquarium instances... until my own tool has progressed enough for my own usage.
Parent - - By Ghengis-Kann (***) Date 2020-02-12 17:14
Hi Centipawn.

Thank you for the detailed update.

120 instances seems crazy to me.
Why not run 12 engines with 10 threads each, or maybe 24 engines with 5 threads each?
You can get the exact same analysis depth and number of positions analyzed by altering the IDEA parameters.

In other words a single 10 thread engine set to 10 seconds per move will do the same thing as 10 individual engines set to 100 seconds per move.
In both cases every 100 seconds you get 10 positions analyzed with 100 (engine*seconds) per position.

This vastly reduces the number of work you have to do configuring remote engines and will be a lot easier for Aquarium's task scheduler to deal with.

You can also use this to balance engines between different computers so every position gets the same quality of analysis.
I gave an example above where a 5 thread engine on a Xeon 2650 is equivalent to a 3 thread engine on a Ryzen 3950X.
Parent - By pawnslinger (****) Date 2020-02-12 17:23
I agree.  This should work for "centipawn".  Cut back the number of instances and increase the thread count per instance.

I, myself, do not do this because I like the control I have with more engines 1 thread per.  But I don't have a 64 core CPU either.
Parent - - By dickie (**) Date 2020-02-12 17:37
Can I suggest you try turning off hyperthreading and run 60 instances, 1 per core. Also try using the pop version of Stockfish rather than bmi2, to see if it makes a difference.
Parent - By pawnslinger (****) Date 2020-02-12 17:44
Yep, I run the popcnt version.  I've always used SMT though... it works pretty good on Zen chips.
Parent - - By Dadi Jonsson (Silver) Date 2020-02-12 18:08

> Has anyone ever had Aquarium run successfully with > 64 remote engines configured and not had this problem?

Aquarium can easily handle more than 64 engines, and the latest versions can deal with much faster analysis than older versions.

You don't give all the information that is needed to analyze the problem you are having, but I'm sure it is easily solved. It certainly has nothing to do with the fact that you are running on 3990x. You need to check things like total memory usage (compared to physical memory), check if the trees are OK etc. There are lots of ideas in older posts about what might be causing this issue.
Parent - By pawnslinger (****) Date 2020-02-12 18:55
I agree... mostly.  But somehow I doubt the "engine hanging" problem is related to a corrupt tree.  On rare occasion I have seen this happen myself, with as few as 6 engines running.  So I don't think it is caused by too many engines running (in my case, anyhow).  When this has cropped up in my work, I find it impossible to cancel the engine, it is not actually running, as Task Manager shows 0 CPU usage.  It is just stuck - and others have reported this happening too.  When confronted with this, I am able to just exit the program and when restarted, everything is fine again.  Rarely does this happen, but when it does, it usually hangs Aquarium for hours, as I don't keep real close tabs on it... I have a life and I don't like to just sit there watching "grass grow".

I have a 900 second max time set usually (600 seconds otherwise), so one might think Aquarium would notice an engine stuck for hours.  Clearly the max time was exceeded.
Parent - - By centipawn (*) Date 2020-02-13 06:03
Hi Dadi,

total memory usage is less than 25% of available RAM. Trees are OK and I can reproduce this problemon a new tree project. The behaviour is exactly a Pawnslinger described it - task finished, but Aquarium keeps waiting for it. If you are interested in debugging the problem, let me know what information you need and I will supply.

Using less engines with more than 1 thread assigned solves the issue. It has been running stable overnight with 30 engines set at 4 threads each now.
Parent - - By pawnslinger (****) Date 2020-02-13 06:43
Here is an interesting article about the 3990x... don't know if it applies in this case or not... but it makes interesting reading:
Parent - - By centipawn (*) Date 2020-02-14 16:30
Thanks Pawnslinger. I have come across similar rumors before (even before I laid hands on the 3990x).

In my tests (outside chess - work related, but based on Win 10 Pro), I could not find any of the Win 10 related problems mentioned there. The CPU is reported by the OS as one socket, 128 threads, and when tasked with appropriate workloads, will run efficiently. We have not encountered any undue overhead when e.g. running parallel workloads in 128 threads. It does, of course, reduce clock speed when under full load, but this is by design and expected. We have done similar tests on Linux, and while the results are a bit better, it is by no means significant. This is for our workloads, mileage for different use cases will (definitely) vary.

Since I switched to 30 Stockfish instances with 4 threads each, it is also running stable with Aquarium and happily crunching away. kN/s figures depend on the position of course, but with the CPU running at 100% load, I have mostly seen figures in the range between 4000 and 4500 kN/s. When using just a single Stockfish with 4 threads, the highest I have seen was touching 8000 kN/s. Clock speed under full load does not go lower than 3.5 GHz with 30 instances @ 4 threads each, which is significantly higher than what I expected. CPU temperature goes up to 63 degrees (Celsius) under constant 100% load, which is perfectly fine. All of this is with default and recommended settings - no overclocking of any kind.

Compared to my 16 core PC, which has a SATA SSD, I also noticed that minimaxing and in particular task generation times are much lower on the bigger system, which has a NVMe SSD (and a Evo Pro one at that). So when trees get big and these wait times become a problem, consider upgrading to NVMe if you haven't got it already.
Parent - - By Ghengis-Kann (***) Date 2020-02-14 17:50 Edited 2020-02-14 18:03
Hi Centipawn.

What CPU cooler are you using?
Power supply?

3.5 GHz on all cores is indeed impressive.
I only get 3.7 on all 16 cores of the 3950X with a Noctura NH-15 SE air cooler and MSI Prestige Creation MB.
(Edited because I just remembered that is with the BIOS set with Precision Boost 2 in 90% Eco mode, which gets my power draw at the wall down to about 225 Watts. Running flat out it will do about 4.1 GHz on all cores).

I've been playing a video game called Subnautica that takes a long time to load and will probably cash out some Bitcoin for an NVMe drive to speed things up.
Should I get the Evo Pro? I have had good luck with the EVO 850 SSD drives but started buying the cheaper Mushkin ones lately. Does it matter?
Parent - - By centipawn (*) Date 2020-02-15 10:32
Hi Ghenkis-Kann,

the motherboard is a "ROG Strix TRX40-E Gaming", power supply can provide 850 watts, but I have not seen it go over 420. RAM is 32 GB DDR4 3200 CL14-14-14-34 (was meant to be more, but is only 32 GB at the moment as one of the other sticks had did not work, waiting for replacement). The cooler is an AIO watercooler, "ASUS ROG STRIX LC 360".

Don't judge me for the flashy gaming stuff - choice of component was mainly due to work circumstances. The motherboard has a solid 16 line power supply to the CPU, which is important because fast changes in load will cause stress to the power supplying components, and any weakness there will result in instability.

The Evo Pros are faster than the cheaper models, but this mainly shows with workloads that either do a lot of random access I/O or have sustained write workloads. Based on the information you supplied, I suspect the long time to load is more an issue of getting all graphics data transferred to the GPU. The transfer happens from the SSD, using the CPU, over PCI, to your GPU - the weakest link will be the limit. I don't know Subnautica and can't speak to how efficient it stores it data on the SSD, nor do I know which GPU you use, which makes it hard to judge if the disk or your GPU is more likely to be the problem.
Up Topic Rybka Support & Discussion / Aquarium / Aquarium2020 on Threadripper 3990x
1 2 Previous Next  

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill