I previously used Aquarium in a similar configuration on a 16 core Threadripper with 24 instances, and 16 additional remote instances. This worked fine and was fairly responsive.
On the 3990x, which has 64 cores, I find Aquarium gets fairly unresponsive in the GUI. It works, and will use all configured instances (I tried with 16, 32, 64, 100). But beyond 16, the GUI really lags (seconds), which is weird. I tried upping the priority of the GUI process to 'realtime', but even that did not help. I'm not clear currently what the bottleneck is, but it is certainly bad enough to take the interactivity out of IDeA.
The odd thing is: even when only 4 of the engines are busy, the GUI becomes pretty unresponsive. Something is amiss here... will investigate.
Having said this - this report is just the result of 4 hours of fiddling. Just wanted to put it out there as I imagine some of you may consider upgrading to the 64 core beast.
I had to up my depth of analysis, because the chip was cranking out TOO MANY evals per second.... the backend couldn't keep up and so the GUI slowed down. I would increase my depth of analysis by 25 percent and see if that helps. You can always fine tune for your hardware.
By the way, my backend runs to a fast SSD, so that wasn't the problem. Just in case anyone was wondering. And my system has plenty of ram at 3200 mhz. In order to keep things smooth, I had to add about 40 percent to my analysis depth (Zen is fast and the 3990X appears to be the fastest).
Firstly, the lack of responsiveness only happens for Aquarium. Even with 120 Stockfishes calculating away, other programs remain fully responsive with no discernible lag.
Secondly, Aquarium has a serious problem in this setup. I repeatedly ended in a situation where an IDeA project would show one or two tasks in progress (currently I have one that it says has been running for over 90 minutes - my setting is a max time of 60 seconds), but does not show any engine busy (nor is there, as far as I can make out), and that is where it never recovers, effectively stopping completely (waiting forever for this task). This type of situation happens sooner with more engines configured - with 120 engines configured, it will happen within 2 hours usually.
I can only make a wild guess at this point: maybe Aquarium has a bug in the threading / thread synchronization (more likely) and/or communication logic with UCI subprocesses (less likely) causing it to occasionally disregard a task that is finished (or should be stopped due to time overrun). Having said this, I am a little surprised that this never showed on a 16-core machine with 24 local and 16 remote engines configured. Odd. Disk or memory speed/size are definitely not the issue here (3200 MHz CL 14, NVMe - Samsung Evo Pro).
However, at present this makes the 3990x effectively useless for Aquarium. It is lightening fast if you don't mind babysitting it to restart when it hangs, and the UI locking up. If I want to avoid either, I have to dial down the engine setting to such a low number that the 16-core machine is actually faster.
I am running Aquarium just fine on a 3950X, which has exactly the same architecture as far as I know except it only has 2 of the 8 core chiplets instead of 8 that the 3990X has. (edited to add that I am aware of other differences like the number of PCI express lanes but nothing relevant to running Aquarium).
There are different versions of stockfish (popcnt, and some others that I can't immediately remember), my guess is you are running the wrong one.
I also recall from the Komodo readme that there is a particular instruction that is not well executed by the new ryzen chips.
Here it is:
NOTE: On AMD Ryzen CPUs you should not use the BMI2 version of Komodo. The Ryzen implementation of the PEXT instruction used in the BMI2 version is very slow, so please do not use the BMI2 version on those machines.
If it was Stockfish related, then performance would just be bad, but that's no reason why the GUI of Aquarium would become unresponsive.
> t's not that. Stockfish does just fine (and yes, I am using the BMI2 version with the PEXT)
I thought with Zen you should use popcount
> Indeed, and I have switched to that since!
And did that solve the problem?
On another forum someone wrote: "BMI does horrible things to Zen..."
The main problem is still that Aquarium fails to realize a task is done when I configure e.g. 60 engines with 1 thread each.
I am now running 30 engines with 4 threads each. Not ideal, but at least stable.
<joking> Maybe a high BMI is bad for karma... not sure what Zen has to say about that ;-) </joking>
Are you analysing at a fast TC?
It works for a while, but eventually, it will hang because it keeps waiting for a task that did finish. But it is very odd that this does not happen on other CPUs, even when using many more engines. Interestingly, it also happens if I run Aquarium on the 3990x, but all engines are remote. Weirdly, it also happens if I run Aquarium on a different PC, but the remote engines on the 3990x. I work in IT and know quite a bit about multithread / multiprocess / distributed programming etc., but I can't come up with a theory that would explain all these facts.
The first thing I would think of to try and stop this would be to find a way to increase the size of the buffers in question. But hackers know that as long as the buffer is finite, it can be overloaded. So a permanent fix involves changing how the buffer is managed (a task only Microsoft can easily do). When confronted with this sort of problem the fix has been to rewrite the code in such a way that it doesn't use any of the Microsoft buffer handlers. Not very practical in this case, since none of us are Aquarium programmers (at least I don't know of any).
I bet the UCI protocol allows some sort of verbosity control... if it can be accessed from Aquarium is another question entirely.
I wrote a little utility for my own usage that basically stores a chess tree in a database and runs engines, similar in that aspect to what IdEA does (but without GUI and a simpler algorithm for prolongation and alternatives - more a research project than something for actual CC usage at the moment). It can cope with 125 engines on the 3990x and an additional 100 remote engines, no problem. So it is not an intrinsic problem of the platform.
Have you tried clearing out all the cache from every conceivable place in Aquarium - I.A etc etc ?
I'd be very interested to see if this happened with a clean install of windows and a fresh install of Aquarium with no previous trees. If it does there's nothing you can do.
Did you try my suggestion?
Give it a try. With 120 engine instances running, I am sure that you are funneling too much too fast into the Aquarium backend. You either have to cutback the number of engines or (sadly) find someway to slow them down. Ironic is it not?! We all want faster CPUs, but Aquarium simply cannot handle the quantity being produced.
I don't know what depth you are using, but for the sake of discussion, lets assume you are using ply depth 25... increase it to 38 ply. That should give you much higher quality and much slower speeds.
And that is the benefit of using a 3990X !!! A rather huge leap in Quality.
There is a way to use such a massively parallel CPU without slowing things down (sounds odd to say that). And the way you can do that is to run multiple copies of Aquarium. On my system I can run 5 copies of Aquarium at faster speeds. Ironic? Yes! So you could run 10 Aquariums giving 12 engine instances to each and you might be okay. In this way you are still able to run 120 Stockfish instances, but spread out over 10 Aquarium backends. I mean, I wouldn't do that myself, I would just take my higher quality and go for my GM title.
(edit to insert the following:) What makes me doubt your theory is that even when starting IDeA with my settings, the GUI is terribly slow from the first second - even though the first results will not come in for another 50-odd seconds.
Ultimately, I aim to replace Aquarium with a little piece of custom software I am working on. Far from finished - but I did already run a little test harness with 120 instances of Stockfish working away and logging results (not into a proper DB so far). There's no problem there, it is as fast as you would expect. So it's definitely not a Stockfish problem.
If it is a GUI problem then you can still harness the full power of it by running Aquarium on a different machine.
Not an entirely hypothetical question for me because I plan to go that way in the not too distant future, but will wait for the 4990X to get the cache improvement and the added efficiency from the whole thing being built at 7 nm.
A ply depth of 30 is pretty good, but in your setup, I think it is still too low. If you have Stockfish producing 120 evals per minute,, 120 instances of Stockfish each producing 1 per minute, that would be 7200 per hour., if my math is correct. In my opinion, you want to get that below 1000 per hour. That is the maximum I use anyhow.
In actual usage, I keep it way way under 1000 per hour. I cannot imagine the Aquarium backend handling 7200 per hour. I wish it could.
If you are writing the results of your own Stockfish handler to a flat file... yeah, it can handle 7200 per hour. A flat file is much different than a db backend - which has a lot more overhead involved. I wish I knew how to calculate the exact number of transactions that a Zen cpu can handle. I expect there are lots of variables, like memory speed, memory size, buffer sizes, disc speeds, etc, etc, etc. And I think Ghengis may be wrong about the PCIe not making a difference... if your backend is going out to NVMe, then the speed of your PCIe will definitely figure into the formula. And of course, then there is the coding in Aquarium itself, which I suspect has not been developed with hugely parallel cpu's in mind. Nor with very fast cpu's in mind either.
In my opinion, the development (which is no longer being done) of Aquarium was done to take advantage of the least common denominator cpu, think i7- four core 3rd generation Intel, and I think you would be close to the mark. Zen with 64 cores is way way beyond anything that Aquarium was developed to handle.
In any case chess analysis does not appear to be IO bound, but rather a direct function of clock speed and instructions per clock.
My Xeon 2650 server is i5 level technology with 16 cores, 2.6 GHz clock speed, DDR3 1600 and PCIe3. The 3950X runs 3.7 GHz on all 16 cores, has DDR4 3200 and PCIe4.
My engine balancing experiments determined that 5 threads of the Xeon equals 3 threads of the Ryzen, which is completely accountable by the higher clock speed and improvements in instructions per clock. The much faster memory and bus speeds are contributing essentially nothing.
I think that a more efficient cpu can bottleneck lots of different things. Including the PCIe bus. What happens with lower cpu's (like mine) do not always translate well to higher end cpu's.
On my cpu (1700X) I cannot bottleneck my PCIe bus. So you are right. On some other setup with a faster cpu, and faster NVMe, a slow PCIe bus could possibly bottleneck. Especially depending on the other components in the system. It is very hard to make a generalization, when I don't have his system to test it on.
It's like driving a donkey cart down an 8 lane highway.
But in my system, Chess analysis is seldom all that is going on. I don't even begin to know how to test whether the PCIe bus is bottlenecked or not. I do know that the general view is that modern video games, even using high-end GPUs, do not bottleneck it. And they use a lot more bandwidth than Chess analysis.
However, there must be a reason why they developed PCIe 4.0!? If PCIe is not a bottleneck, why jump to 4. Oh yeah, I guess it must be because more and more things are hanging off that bus everyday. That could be it.
So you could be right. But I think it must be a case by case analysis. Depends on the system and what else might be going on in it. If just Aquarium and Stockfish, I am positive you are right. But it doesn't happen that way in my system. For starters, I run a video server on this machine, Serviio... it is almost constantly playing videos for someone in my home. That runs in the background 24/7. Then there is whatever I am doing personally (this is my workstation), the Aquarium analysis just hangs around in the background most of the time... I play videos and normal games in the foreground (I am retired, lots of time on my hands). And this browser has 3 tabs open right now, usually YouTube and news type sites. And all runs smoothly even if Aquarium is bogged down... so I am sure that the PCIe bus on my system is more than adequate.
So yes, you are probably correct... in my case.... and in yours. But I haven't a clue about "centipawn" and his system (except that it is a really fast system).
You know a funny thing (I like to say that <grin>), I started having these type of GUI issues with Chessbase a couple of months ago. I Google'd around and found that I was not the only one with such CB issues. I started having the problems after installing Eman. Turns out I had to tweak a few things in CB to get it back on the smooth track. So Aquarium is not alone with this issue.
Ghengis: Does this imply that my dual Xeon system with 2 X 18= 36 cores would be equivalent (on a Ryzen 3900 series) to 3 X 36/5 cores or 21.6 Ryzen cores, in which case a Ryzen with more than 22 cores would be faster for chess than such a dual-Xeon system?
" will wait for the 4990X to get the cache improvement and the added efficiency from the whole thing being built at 7 nm."
I can't find any info on TR 4990X. What have you heard in terms of cores, etc?
According to my rough calculation (very rough) I think you should try to raise your depth from 30 to 45 ply. That will slow it down enough, I think.
You can always do fine tuning.
When doing infinite analysis, I see on Ipman chess that all the speedsters with great TR 3900-series results use threads = 2 X no. of physical cores (and SF popcount). However, that is for infinite analysis.
I was under the impression that when using Aquarium/IDeA, one should not use threads but only physical cores, e.g., on my dual-Xeon master system with 2 X 18 = 36 cores + another 20 cores available on LAN, I can comfortably use 55 of the 56 cores (at least overnight) in IDeA without any problems, leaving only one core free on the master system for Aqr/Win10. Why should Ryzen TR chips be any different than a Xeon? Perhaps Aqr/IDeA is happy only with physical cores but not with extra threads?
Regarding "use higher depth or Aquarium can't cope": No, does not help. I tried it over night with 120 Stockfish instances, minimum time 600, maximum time 900, depth 42, which should - due to minimum time 10 minutes - ensure that no more than 1 result every 5 seconds comes in (on average). It got stuck fairly quickly, again waiting for the last task in the queue to finish - except that was not running anymore. After that, no progress whatever happens until I kill it. Note that on my 16 core machine, I had about 1 result every 3-4 seconds come in with no problem.
I also tried activating 9 projects, with 112 Stockfishes configured in 7 blocks of 16 each (just to see if that makes a difference). Just after activating IDeA, the GUI freezes, and it took 7 minutes for it to react to a button press (on Stop IDeA).
Remote engines - yes, that crossed my mind too. Main reason I have not tried it yet is that I have to figure out first how to automate setup, as manually configuring 120 RtHomeServers & 120 remote UCIs in Aquarium is about as much fun as having your tongue stapled to the floor.
Re: Database speed, bottlenecks PCI throughput etc. I have quite some experience with all that (more than 20 years in IT now), and I share the opinion that Aquarium was never built for this kind of usage (how could it have been, back then - it clearly has not been modernized thoroughly in recent years). What we are observing is very far from the actual limits hardware and software impose these days. I have seen several databases holding billions of records and taking in tens to hundreds of thousands of records every hour (financial market tick information is one example). Desktop installations of SQL DBs such as PostgreSQL, some noSQL-DBs such as Mongo, and certainly specialized databases (many exist for fast logging) can easily consume hundreds to thousands of records per second with little sweat if configured and used wisely. PCI lane numbers and speed are of little concern here. The main reason why PCI 4 exists is because when modern video games start up, they need to transfer a great deal of data (GBs, typically) to the GPU's memory, which causes a delay in game startup. This is where these transfer speeds matter. For chess - even for LC0 - they are irrelevant. In fact, for chess, CPU is the only thing that matters. I did wonder if, with 120+ engines running, memory bandwith for hashing would become an issue. Based on my measurements, it is not - I see 100% CPU utilization. If memory was throttling computations, this figure would be less.
To elaborate further on the DB issue. With an Aquarium style use case, the challenge is not really taking in the data as fast as an engine can produce it. The challenge is how to organize that data so that minimaxing is efficient, threefold repetitions in move sequences are easy to detect, and (ideally) the 50-move draw rule can be observed, all the while ensuring that transpositions don't result in duplicated work. There are several ways to tackle this, but none are trivial. Regardless, I think with modern hardware and software it will certainly be possible to handle at least tens of results coming in per second (I think the hard limit is much higher). Again - not a criticism of Aquarium, which was designed when the kind of parallelism we have today did not exist.
Whether or not increasing the depth for the engine, or have shallow depth but a much deeper tree, is probably a matter of opinion (and I have not settled mine yet). Personally, in my own software, I want to go for a model that uses at least two different engines and adaptive depth. Basically, one of the rules will be: if all different engines agree fairly quickly that the move is bad, accept that it is. If, say, Stockfish says it is +0.9, and Komodo says it is -0.1, then keep increasing tree depth until it is clear which one is wrong. Evaluation depth will also be adaptive. My main gripe with just increasing Stockfish depth is that it will sometimes not look at moves that I want it to look at (often the type of moves that make a game interesting and turn out not to be bad at all - and usually those I add manually in interactive analysis. 99% of the time the engines will tell me I made a blunder, but the remaining 1% have turned out to be moves that can be decisive).
However - to put this in proportion: I am by no means a chess expert. I suck OTB as I will blunder frequently, and have only started playing correspondence fairly recently. As far as software/hardware is concerned, I usually know what I am talking about, but as far as chess is concerned, take my thoughts with a grain of salt (or three).
I know when I started running multiple copies of Aquarium, I had to setup separate folders for each Aquarium instances' engines. When I tried running off 1 engine folder, things were very sluggish. So in my setup I have:
Repeat N times for B, C, D, E etc....
So I have N copies of Stockfish, Eman, etc. N copies of Aquarium, etc. This is one reason I do not use engines with copy-protection, I ran into lots of trouble with them. Especially Houdini... it did not like me running 5 copies of it AT ALL.
What I did was this: I set up 7 local UCI engines all pointing to the same Stockfish executable. I configured all 7 for IDeA usage, each with a multiplicity factor of 16. I did this to see if it would make a difference - Aquarium juggling 16 instances each of 7 "separate" engines, or juggling 112 engines of one and the same. It doesn't. It was just a sanity check - it could conceivably have made a difference if synchronization in Aquarium was bound to the configured UCI engine(s). I'm not too surprises that this is not the case.
What I did different yesterday was that I configured seven UCI Engines for Stockfish in the "Engines" view, and in the "Engines" dialog of IDeA view, I added them all, each with a multiplier of 16.
All these UCI engines pointed to the same executable, all in the same location. I have never had problems with that for Stockfish.
On my 16 core machine, I usually have 24 engines configured for IDeA, but I have two more Stockfish(es?) set up as UCI engines for engine matches - also pointing to the same executable, in the same disk location. I am able to run engine matches while IDeA is working away with this setup, and it has never caused a problem.
This may not be true for other engines. It certainly won't work for engines that use their storage location to write data to (persistent hashes or similar), or that lock any files they read from there.
Windows will still start a separate process for each Stockfish instance Aquarium fires up, i.e. there will be separate copies in RAM, each with their own working data, regardless if they come from the same disk location or not (this is a slight simplification - for any DLL's referenced by the EXE, Windows will only keep one copy of executable code in memory, but any working data will be duplicated and separate for each process, so this does not matter in practice). I'm not clear why your encountered performance differences when branching out into folders - it sounds odd to me.
Incidentally, PowerShell can help you automate things like "copy this exe to those 70 folders". I use it too seldom to know off hand how to do that - have to google it every time, so I resort to hand copying for anything < 20 regardless...
Then when I look in the Task Manager of Windows under Details and it shows only 1 entry for Stockfish, I assume that means only 1 copy is in memory. Maybe I misinterpret what Windows Task Manager is telling me, but there is no mistake when Aquarium tells me something like 100 knps. That is just slow. My testing was done about 3 years ago, perhaps in the meanwhile Windows has fixed some holes that it had. I haven't tried this experiment in a few years.
The problem you are having with an event hanging sounds like a bug in Aquarium, perhaps caused by a table overflow someplace. Where Aquarium can't handle so many engines at once. You'd think the program would have been more defensively written, to not allow such a problem. But clearly Aquarium still has problems... understandable problems, but problems none the less. When folks wrote Aquarium they never thought that one day a 64 core CPU would be used to run it.
Of course, if it can't handle 14, then something is wrong with the CPU or the system itself. Aquarium can handle that many, I am sure of it.
Not a good idea. It worked for (literally) 4 minutes, then it again began waiting on the last task in a cycle, far beyond the time limit set, even though this task was clearly finished already.
Has anyone ever had Aquarium run successfully with > 64 remote engines configured and not had this problem?
Looks like I will have to follow Pawnslinger's suggestion to use multiple Aquarium instances... until my own tool has progressed enough for my own usage.
Thank you for the detailed update.
120 instances seems crazy to me.
Why not run 12 engines with 10 threads each, or maybe 24 engines with 5 threads each?
You can get the exact same analysis depth and number of positions analyzed by altering the IDEA parameters.
In other words a single 10 thread engine set to 10 seconds per move will do the same thing as 10 individual engines set to 100 seconds per move.
In both cases every 100 seconds you get 10 positions analyzed with 100 (engine*seconds) per position.
This vastly reduces the number of work you have to do configuring remote engines and will be a lot easier for Aquarium's task scheduler to deal with.
You can also use this to balance engines between different computers so every position gets the same quality of analysis.
I gave an example above where a 5 thread engine on a Xeon 2650 is equivalent to a 3 thread engine on a Ryzen 3950X.
I, myself, do not do this because I like the control I have with more engines 1 thread per. But I don't have a 64 core CPU either.
> Has anyone ever had Aquarium run successfully with > 64 remote engines configured and not had this problem?
Aquarium can easily handle more than 64 engines, and the latest versions can deal with much faster analysis than older versions.
You don't give all the information that is needed to analyze the problem you are having, but I'm sure it is easily solved. It certainly has nothing to do with the fact that you are running on 3990x. You need to check things like total memory usage (compared to physical memory), check if the trees are OK etc. There are lots of ideas in older posts about what might be causing this issue.
I have a 900 second max time set usually (600 seconds otherwise), so one might think Aquarium would notice an engine stuck for hours. Clearly the max time was exceeded.
total memory usage is less than 25% of available RAM. Trees are OK and I can reproduce this problemon a new tree project. The behaviour is exactly a Pawnslinger described it - task finished, but Aquarium keeps waiting for it. If you are interested in debugging the problem, let me know what information you need and I will supply.
Using less engines with more than 1 thread assigned solves the issue. It has been running stable overnight with 30 engines set at 4 threads each now.
In my tests (outside chess - work related, but based on Win 10 Pro), I could not find any of the Win 10 related problems mentioned there. The CPU is reported by the OS as one socket, 128 threads, and when tasked with appropriate workloads, will run efficiently. We have not encountered any undue overhead when e.g. running parallel workloads in 128 threads. It does, of course, reduce clock speed when under full load, but this is by design and expected. We have done similar tests on Linux, and while the results are a bit better, it is by no means significant. This is for our workloads, mileage for different use cases will (definitely) vary.
Since I switched to 30 Stockfish instances with 4 threads each, it is also running stable with Aquarium and happily crunching away. kN/s figures depend on the position of course, but with the CPU running at 100% load, I have mostly seen figures in the range between 4000 and 4500 kN/s. When using just a single Stockfish with 4 threads, the highest I have seen was touching 8000 kN/s. Clock speed under full load does not go lower than 3.5 GHz with 30 instances @ 4 threads each, which is significantly higher than what I expected. CPU temperature goes up to 63 degrees (Celsius) under constant 100% load, which is perfectly fine. All of this is with default and recommended settings - no overclocking of any kind.
Compared to my 16 core PC, which has a SATA SSD, I also noticed that minimaxing and in particular task generation times are much lower on the bigger system, which has a NVMe SSD (and a Evo Pro one at that). So when trees get big and these wait times become a problem, consider upgrading to NVMe if you haven't got it already.
What CPU cooler are you using?
3.5 GHz on all cores is indeed impressive.
I only get 3.7 on all 16 cores of the 3950X with a Noctura NH-15 SE air cooler and MSI Prestige Creation MB.
(Edited because I just remembered that is with the BIOS set with Precision Boost 2 in 90% Eco mode, which gets my power draw at the wall down to about 225 Watts. Running flat out it will do about 4.1 GHz on all cores).
I've been playing a video game called Subnautica that takes a long time to load and will probably cash out some Bitcoin for an NVMe drive to speed things up.
Should I get the Evo Pro? I have had good luck with the EVO 850 SSD drives but started buying the cheaper Mushkin ones lately. Does it matter?
the motherboard is a "ROG Strix TRX40-E Gaming", power supply can provide 850 watts, but I have not seen it go over 420. RAM is 32 GB DDR4 3200 CL14-14-14-34 (was meant to be more, but is only 32 GB at the moment as one of the other sticks had did not work, waiting for replacement). The cooler is an AIO watercooler, "ASUS ROG STRIX LC 360".
Don't judge me for the flashy gaming stuff - choice of component was mainly due to work circumstances. The motherboard has a solid 16 line power supply to the CPU, which is important because fast changes in load will cause stress to the power supplying components, and any weakness there will result in instability.
The Evo Pros are faster than the cheaper models, but this mainly shows with workloads that either do a lot of random access I/O or have sustained write workloads. Based on the information you supplied, I suspect the long time to load is more an issue of getting all graphics data transferred to the GPU. The transfer happens from the SSD, using the CPU, over PCI, to your GPU - the weakest link will be the limit. I don't know Subnautica and can't speak to how efficient it stores it data on the SSD, nor do I know which GPU you use, which makes it hard to judge if the disk or your GPU is more likely to be the problem.
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill