Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / PC Configuration
1 2 Previous Next  
- - By Legatoalfuturo (**) [it] Date 2008-04-22 20:11

I am planning to get a new PC, and I would like to know from you if this "configuration" may be powerful enough to fastly run games and analysis with Rybka 2.3.2a 64 bit engine on Shredder 10 Gui, with Vista 64-bit naturally.

CPU Intel Quad Q6600 (or Q9300 ??)
RAM 4Gb (2x2Gb)
Video Card ZOTAC (or XFX) GeForce 8800GT 512 Mb
HDD - 2x Western Digital 250Gb 16 Mb cache - Raid 0


Parent - - By Legatoalfuturo (**) [it] Date 2008-04-23 16:05
Anyone able to give me an advice, please?
Parent - - By Legatoalfuturo (**) [it] Date 2008-04-23 19:23
Thank you for your advice Phil ;-)

Frankly speaking I felt astonished reading that is far better to put Tablebases ( I am thinking of the Nalimov many huge files ;-)... I have just those till 5-men...7.05Gb) on a USB Pendrive in order to reach fastest access times!! I was guessing that Raid HD system was faster than USB 2.0 connection. Very very interesting!!
So, in your opinion, I could get just a 500GB 16mb cache hard disk just for normal use and that's it....

As far as the CPu is concerned, since I would like to get a PC with good longevity....I would be inclined to consider Q9550 for my system, 45ns technolgy of this processor should perfectly fit and be exploited by Asus P5K-E motherboard, shouldn't it? DO you believe that 4Gb (2x2Gb) will be enough on a Vista 64 bit version under huge use? (I am mainly considering calculation involved with chess anlysis but also huge latest games.....)

As regards the cooling part of the "project", what about a case from Lian Li, maybe one from the Silent range? (Maybe the a12.....ATX midtower). Would it be enough to keep down temperature under moderate overclocking with respect to silence? I know they are some expensive but it seems they are just "the state of the art" about silent and "refrigerated" cases.

Many issues, my kind friend....:-)

May I ask you one thing about Rybka2.3.2a?
In your opinion, which one is the strongest opening book for this engine? I downloaded Perfect13.ctg and imported inside Shredder 10 GUI but I have not seen such an improvement if compared to RybkaII.ctg. Am I mistaking something with configuration or whatever else?

Sorry for such a long post.

Parent - - By Phil Harris (***) [gb] Date 2008-04-23 23:06
The thing I understand with TB's is that for the small amount of data that has to pass between the storage and memory for each hit, the access time is far more important than the bandwidth. So while you will get very high transfer rates for large files with HD's in a raid 0 setup, flash memory will have faster access times. They are also much cheaper and can tolerate the nature of the task much better than a mechanical device.

I am not sure I like the choice of case too much. Lian Li make wonderful stuff, but avoid anything that has low noise as a design feature rather than very good air circulation. I do like their V2000 series very much. I would also have a look at the Antec 900 for a very good value well ventilated case.

I think the Q9550 should be the way to go, though I would first check that the P5K-E will support the 8.5 multiplier. If you were restricted to 8.0x's you would need higher FSB speeds than might be comfortable. I have had very good experience with the X38 boards from Gigabyte, and they do support 8.5 Multipliers.

Books are something of a black art, and I am not the best person to ask. I found HS for Rybka to be a very solid performer. In general books composed to compliment the playing style of a specific engine will always be best, which probably explains why RybkaII is still a good performer despite it's age.
I created my own book from computer games downloaded from various sites, then added in my own game history, it plays OK, but probably no better than stuff you can download.
Parent - - By Eelco de Groot (***) Date 2008-05-08 17:42

I don't have one of these beauties but for playing tournaments or serious Blitz with a program that uses tablebases or bitbases I would like to have one!

I have a question, maybe I could find an answer somewhere on the forum , but can anyone tell me if you have a 64-bits operating system XP or Vista 64-bits for instance, are there problems using more than 3 gigabyte of main memory? I keep reading that normally Vista only recognizes the first 3 gigabyte, most desktop systems seem to have that as a maximum? Thanks for any advice!

Parent - By Mark (****) [us] Date 2008-05-08 17:53
I just got an HP Q6600 with Vista 64 and it came with 4 GB ram standard.
Parent - - By Vempele (Silver) [fi] Date 2008-05-08 18:15
Most of the point of 64-bit is that you can have lots of memory. I think Vista supports up to 128 GB.
Parent - By Eelco de Groot (***) Date 2008-05-08 19:07
Thanks guys, that is good to know. Strange that there are still so many systems sold with 32-bit Vista, how difficult can it be for old hardware to provide new drivers that are at least functionally compatible? I suppose testing 64 bit drivers in real situations is the more expensive and timeconsuming part of it.

Parent - - By Phil Harris (***) [gb] Date 2008-04-23 16:51
That setup would cover just about everything that you need, the video card is a good performer, far more than you need for chess of course but for games it's very good.

Doing your RAM in 2x2gb is also a good plan. Not sure I would bother with a raid setup for the HD's, the speed is not really a great benefit for gaming or chess. If you are thinking of table bases, the best way is to put them on a USB stick, which speeds them up nicely, and is getting cheaper almost daily.

As far as the CPU is concerned, if the budget allows, I would consider a Q9550, as it will run cooler and over clock higher for a given temperature than a Q6600. It also has larger L2 cache, which should provide a slight performance benefit.

I am assuming you will want to over clock this at least a little. Money spent on good quality cooling will help enormously. A Q6600 G0 should be easily capable of running at 3.2gHz with good air cooling, with the Q9550 about 3.4gHz should be simple to achieve and perfectly stable on air. This is provided it's in a case with good air circulation of course.
Parent - - By Legatoalfuturo (**) [it] Date 2008-04-23 19:26
For some reason I mistook in replying to your message, and the forum put it before your message itself. Sorry for the confusion :-)
Parent - - By M ANSARI (*****) [kw] Date 2008-04-23 20:12
Phil ... how is the 9550 clocking?  I haven't tried it yet but it should in theory be better than a G0 due to 45nm and more cache...  Also is it compatible with the older P35 motherboards?
Parent - - By Phil Harris (***) [gb] Date 2008-04-23 22:28
Hi Majd,

As far as I have seen so far, it over clocks well, the only issue being a maximum of 8.5 multiplier. That also raises a possible issue with P35 motherboards. I know the P5K-E does support 45nm CPU's, but I am not sure it will allow you to access the 8.5 setting.

I have two QX9650s running on my X38 and P35 Gigabyte boards and while the .5 option appeared after a BIOS update on the X38, it's not yet available on the P35. So it maybe a P35 limitation, or just Gigabyte being sluggish with updates, I shall try to find out.

I have seen a few at 4+gHz, though obviously pretty high FSBs for that, but Vcore requirements as you would expect are inline with the QX9650. I like the idea of running one at 8.5x440 for a nice stable 3.74gHz at hopefully 1.38 to 1.40 Vcore, which should be a very good 24/7 over clock which is not too hard to cool.


Parent - - By M ANSARI (*****) [kw] Date 2008-04-24 08:59
That would seem good ... at 3.74 Ghz that would probably equal the performance of 3.9 or even 4Ghz Q6600 due to the extra cache.  I haven't seen this processor on Newegg so it must be pretty new.  I see it as a 2.83Ghz speed ... so that should make it more expensive than the Q6600 by quite a bit.  There must be a Q9450 or equivalent to a Q6600.  The 1333mhz FSB seems to make it a little difficult to overclock or at least to get good gains ... so looks like the Q6600 is still a better bet for best bang for the buck ... hard to believe since it is about a year old already.
Parent - - By Banned for Life (Gold) Date 2008-04-25 13:53
The extra cache shouldn't be much of an advantage for Rybka since the primary memory usage is related to the hash table which is designed to minimize locality (i.e. the chance of getting a cache hit). In fact, I suspect that Vas prevents storing hash values in the L2 cache entirely.

To the extent that a 45nm quad setup is better than a 65nm quad setup at the same clock speed and FSB, the improvement in nodes/clock is most likely attributable to other minor changes and corrections that Intel made when they did the shrink (and for dual quad core setups, the major improvement is probably in the chipset rather than the CPUs).

Has anyone measured Rybka 2.3.2a kn/s for 45nm and 65nm quad setups with the same clock speed, FSB, and chipset to quantify this difference?

Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-25 15:52
I did. The extra cache helps Rybka very well - speed increase is roughly 6%.
Zappa doesn't care about that extra cache - speedup is less than 2%.

Parent - - By Banned for Life (Gold) Date 2008-04-25 18:53
The cache is only one of the changes that was made during the 45nm Penryn shrink. I don't think its the cache that's responsible for Rybka's clock for clock improvement. I would like to hear from Vas if he is bypassing the L2 cache when accessing the hash (I believe he is).

Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-25 19:35
And why do you think Zappa doesn't get faster on the new 45 nm processors?
There are no really important improovements besides bigger cache.
Parent - - By Banned for Life (Gold) Date 2008-04-25 21:28
Wrong. Streaming instructions (excluding SSE4) are anywhere from 0.5% to 13.4% faster in Yorkfield processors than in Kentfield processors running at the same clock and FSB speed. This improvement is due to micro-architecture improvements that reduce many latencies by 1 clock cycle and has nothing to do with cache size.

A 12MB L2 cache is just as useless as an 8MB L2 cache if you're silly enough to try to use it for fast access to a 2GB transposition table. Any hash cluster that went into L2 would be gone before it was needed again. I would be very surprised if Vas was using it for this, and the small difference in hit rates for instructions and other tables will be hardly measurable.

Rybka would be expected to get more of an improvement than Zappa because it is more sensitive to latencies than Zappa. Period.
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 06:14
Maybe you remember AMD Venice and San Diego - same design, only difference cache size.
At the same speed, Rybka performed 5% better on San Diego.
Parent - - By Banned for Life (Gold) Date 2008-04-26 07:10
For the Venice and San Diego, your statement would have been correct. The relative difference between the 512 kB and 1 MB L2 caches on the Venice and San Diego likely caused an increase in Rybka instruction and non-hash table cache misses.

Your statement that the only difference between the 65nm and 45nm processors running at the same clock and FSB rate is the cache size is demonstrably wrong. The most likely reason for better Rybka performance is probably shown in the table below.

I doubt there would be even a 1% difference due to higher instruction and non hash table memory accesses going from 8 MB to 12 MB.

Attachment: shuffle.jpg (50k)
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 07:51
I didn't say the only difference is cache size. I said: There are no really important improovements besides bigger cache.
That's something different, I think.
I don't think Rybka makes havy use of SSE instructions - but I don't know exactly. There is one fact that makes me assume this: Rybka performs badly on P4 although P4's SSE performance isn't all that bad.

So one way to find out if Rybka likes bigger caches would be to get a Core 2 Quad 9300 and compare it to my (then) underclocked QX9650.
Or maybe we should ask Vas if he uses SSE at all.
Parent - - By Vempele (Silver) [fi] Date 2008-04-26 07:55

> There is one fact that makes me assume this: Rybka performs badly on P4 although P4's SSE performance isn't all that bad.

Chess engines perform badly on P4 period.
Parent - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 08:02
Yes. That's because chess programs mainly consist of normal integer math - not of SSE instructions. :)
Parent - - By Banned for Life (Gold) Date 2008-04-26 08:06
I suspect that Vas does make use of streaming instructions, but the improved shuffle engine is not relegated to performing only streaming instructions. It deals with all data alignment issues, and all chess engines do a lot of data alignment.

My last P4 ran at 700 MHz before I switched to AMD. The P4 was designed with very long pipelines to allow high clock rates. These long pipelines lead to very high latency and major penalties for things like branch misprediction. Rybka runs poorly on systems with high latency, so its not surprising that its performance on the P4 is really atrocious.

Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 08:22
There was no P4 with a clockspeed of 700 MHz, the lowest speed was a Willamette at 1.3 GHz.

And why does Zappa not get faster on the new 45 nm chips? Or does Zappa not use data alignment? Like all chess engines?
Parent - - By Banned for Life (Gold) Date 2008-04-26 08:48 Edited 2008-04-26 09:03
OK, in that case the Dell I bought in April 2000 had a 700 MHz PIII in it. But I don't feel so bad about my memory when I see you having a hard time remembering what you wrote a few hours ago:

Zappa doesn't care about that extra cache - speedup is less than 2%.

But I guess what you're really saying is that either Intel is bullshitting about the new shuffle engine or that data alignment should be exactly the same proportion of processing in every engine? I can't follow your argument.

The obvious explanation is that Vas is either doing more shuffling than Anthony or Rybka is more sensitive to shuffle latency than Zappa.
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 09:14
I remember very well what I said :)
Please read my last post again - I repeated that Zappa isn't faster on the new chips - but that doesn't match with your theories.
If my last post was mistakable, I apologize.
Parent - - By Banned for Life (Gold) Date 2008-04-26 09:26
When you improve one component of a processor, you get a speedup only if you aren't already being throttled by a different bottleneck. You've stated that Rybka improves a lot more than Zappa at same clock and FSB when going from 65nm to 45nm. I don't find this too surprising. There are many possible explanations. My favorite theory at this point is that Vas is doing more streaming operations than Anthony.

Note that one could also ask why a larger cache helps Rybka but not Zappa (I don't think it helps either one, but this is your theory).

Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 09:48
You are right :)

My favorite theory is: Rybka loves bigger caches.
We all know that Rybka takes good advantage of fast memory. Zappa doesn't care for memory speed.
Vas once told me when we were discussing mp speedup, hash fetches are not the problem. So maybe Rybka's inner loop doesn't fit into the processor cache. This could be an explanation - but I doubt that.
Bigger cache improoves average memory speed - maybe it's that simple - maybe some data from RAM is used twice?
Parent - - By Banned for Life (Gold) Date 2008-04-26 14:38
Bigger caches improve memory access times for data that can be expected to still be cache resident when it is needed multiple times. There are a number of scenarios where this breaks down. Streaming applications are one, because the data is only used once. Large hash tables (not necessarily for chess) are another because the hash mechanism is designed to access random locations in a much larger memory space.

It would be super fantastic if anyone could figure out a way to have hash table accesses be available in the L2 cache rather than having to go out to DRAM. But if you allow the hash accesses to be stored in L2, you end up with cache pollution, where important stuff in L2 cache (like instructions and other tables) gets overwritten by hash table entries. Since the hash table is designed to randomize the location of any hash entry in memory, your chance of finding the hash entry in the L2 cache would be approximately cache size / hash size. If you use the maximum possible hash size, this comes out to 12 MB / 2 GB or 0.56% hit rate. This is too low to cause a meaningful improvement (and the loss of L2 access to instructions and non-hash tables would be a much bigger performance hit).

Vas is well aware of all these issues and most likely does not allow hash entries to be stored in L2. Its possible that Rybka uses huge tables that need close to 12 MB of storage. Even then, I would be very surprised if the performance difference with an 8MB hash was 6%. This would only be possible if the accesses to this data was cyclical, negating the caches LRU replacement strategy.

Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-26 16:13
All this sounds really logical.
Some questions arise:
Are hash fetches really only used once (in case Vas allows them to be stored in L2)?
I made tests with very small hash tables (with only 1 core too), down to 1 MB - kn/s didn't go up very much. This could be because they aren't stored in L2, or - as Vas claims - because hash fetches aren't the problem.
Is it possible that Rybka uses huge tables that don't fit into L2? I don't think so - the size of the executable is only 2.69 MB for Rybka 2.3.2i22 (current freestyle version). OK, tables could be compressed in the executable file - but that is unlikely.

I really don't know why Rybka (and only Rybka) is so memory bound.

Btw. our little discussion really helped me.
To confirm my statements I clocked down my X5460 to 3 GHz and made some brief tests comparing it to an X5365. When I fist saw the results I was shocked:
Rybka being ~22% faster on the X5460, Zappa ~16%. Was all that I told you bogus? That X5365 was formerly my surf comp using 2 5120 Xeons. I hadn't optimized BIOS settings for chess performance. I put in 2 5365s when I disassembled another computer and they were left over. My tests reminded me of optimizing BIOS settings. Worst for chess are prefetches (OK, they improove streaming memory performance - but random accesses get slower), the snoop filter of 5000X is bad for chess too. After correcting these settings performance was normal again - next to 0 performance increase for Zappa and 8% gain for Rybka (these numbers are never super precise). I hope, this will help Vas too, as he currently is using this computer (and some more) in Freestyle.

Parent - - By Vempele (Silver) [fi] Date 2008-04-26 16:19

> Are hash fetches really only used once (in case Vas allows them to be stored in L2)?

Twice (probing and storing).
Parent - By Banned for Life (Gold) Date 2008-04-26 16:47
OK, then the question becomes: How far apart do these accesses occur? If they are close together, the data will still be in the L1 data cache, if they are far apart, even the L2 data cache will not be large enough. I think you may have both of these cases occurring in a typical chess engine along with the possibility that the second access will be in the intermediate time frame where it would be in the L2 cache (if you allow it) but not in the L1 cache.

Parent - By Banned for Life (Gold) Date 2008-04-26 17:03
I suspect you are right and Vas's code fits reasonably well in either 8 or 12 MB of cache. As you've mentioned, this can be checked at some point by running on a clock/FSB equalized quad with only 6 MB of cache and seeing if the performance is the same as the 12 MB cache quad (of course this needs to be done with the same chipset and memory too).

I get the feeling that Vas has implemented a more efficient memory management system than other engines are using and this explains the increased importance of memory (and other) latencies. Get rid of one bottleneck and another one instantly appears...

Prefetching and cache snooping both cause additional FSB traffic and cache snooping may require more work in the TLB as well. Prefetching is useless for hash accesses that are cache line alligned so turning this off is a no-brainer. Turning cache snooping could result in occasionally using out of date data, but this has been shown to have a negligeable effect on chess engine performance.

I have also gotten something out of this discussion. When I am running a combination of 45nm and 65nm quads, I will know to run Rybka on the 45nm processors and Zappa on the 65nm processors to optimize results!

Parent - - By Wayne Lowrance (***) Date 2008-04-27 02:17
Hello again Lukas. It is interesting discussion about the Bios settings for  example Rybka and Zappa.
My question is, as soon as i get my funding i am going to purchase a MacPro 8x, 3 gighz in part based on our conversations. As I understand the mac pro has no Bios that can be optimized for chess. Is my understanding right. Should that affect my decision in buying the MacPro over the Jncs equivalent.
Thank you
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-27 07:32
I have no new Mac Pro - my Mac Pro is first generation. That old Mac is ~5% slower for Rybka than an equally equipped computer I built myself. The new Mac Pro seems to be OK, but I haven't got exact numbers.

Parent - - By Wayne Lowrance (***) Date 2008-04-27 17:36
Okee dokie, thank you
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-27 18:16
You're welcome :-)
Maybe you should ask Dummkoller - I know he's got a new Mac Pro 2.8 GHz and he had an "old" Mac Pro 8x 3 GHz. He told me the new one was faster for Rybka despite the lower clockspeed. But I don't know if he really made thorough tests.

Parent - By Wayne Lowrance (***) Date 2008-04-27 21:03
Ahhh, again I thank you. I will try and contact him.
Best regards
Parent - - By BB (****) [au] Date 2008-04-25 21:45

>bypassing the L2 cache when accessing the hash

There could be a slight ambiguity with "accessing" here. The use of PREFETCHNTA (NTA is non-temporal access) on some processors elides later eviction to the L2 cache. However, this instruction will typically look in L2 at the onset (and if it hits, will, in this case, evict back to L2). As indicated in a different thread, there is good reason to expect that later versions in the Rybka series are not insouciant toward the use of this.
Parent - By Banned for Life (Gold) Date 2008-04-25 22:15
PREFETCHNTA is one method of reducing cache pollution. Using streaming constructs is another. I think Vas is using the latter approach.

As far as PREFETCHNTA is concerned, its true that if the hash cluster was found in L2, it would be returned there, but one would hope this wouldn't happen very often. Having hash values throw instructions and non-hash tables out of L2 would be highly undesirable.

Parent - By Lukas Cimiotti (Bronze) [de] Date 2008-04-24 19:26
My Asus P5K deluxe supports .5 multipliers, so this is only a BIOS issue.

Parent - - By ppipper (*****) [es] Date 2008-04-24 20:41
Hi all,

I am also thinking about buying new PC, mainly computer chess and analysis.

I prefer workstations rather than desktop configurations, due to the expansion options.

What do you think?

- motherboard: Tyan Tempest
- 1 Xeon 5460 3000 Mhz (wish list: another Xeon...)
- 8 Gb Ram
- 500 Gb HD
- Antec P-190

What are your opinions?

Thanks very much in advance,

Best regards!!
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-24 22:22
Xeon X5460 is running at a speed of 3.16 GHz

I'd take 2x E5430 (2.66 GHz) instead of 1x X5460 if you haven't got enough money for 2x X5460

Tyan Tempest must have 5400 Seaburg chipset (there are also Tyan Tempest boards with 5000 chipset that won't work)

8 GB is OK - must be FB-DIMMs

Parent - - By ppipper (*****) [es] Date 2008-04-25 12:30

Thx for answering.

Do you think this is a good aproach, or it would be better going for QX9650 series...?

Parent - - By Lukas Cimiotti (Bronze) [de] Date 2008-04-25 13:21
If you overclock QX9650 to 4 GHz you'll get the roughly the same Rybka 2.3.2a performance as with 2x E5430.
My 2 QX9775 run on Intel Skulltrail board D5400XS.

Parent - By ppipper (*****) [es] Date 2008-04-27 22:42

Do you think is better a Skulltrail + 2 QX9775 rather than Tyan Tempest 5400 + 2 Xeon 5460?

Thanks for your opinions!

Parent - By ppipper (*****) [es] Date 2008-04-25 13:16
Hi Kullberg,

One more thing, I am seeing your signature right now, just a question: in which motherboard have you mounted your 2 Quads?

Thx a lot!
Parent - - By Sesse (****) [no] Date 2008-04-26 00:39
Is there a particularly good reason you recommend the Q9550 over Q9450? At least here, it's 70% more expensive, and the only change I can see is that it's 2.83GHz instead of 2.66GHz.

(Not that hardly anyone has even the Q9300 in stock here yet...)

/* Steinar */
Parent - - By Phil Harris (***) [gb] Date 2008-04-26 02:27
The Q9450 has it's multiplier locked at 8x's, which means that unless your motherboard is capable of high FSB speeds, any over clocking is going to be limited.

The Q9550 can go to 8.5x's which doesn't sound like a big difference, but that extra 0.5 can make a noticeable impact.

Most P35,X38 or X48 motherboards will accept an FSB at around 440 before over clocking starts to become challenging. With a 9450 this means 3.520gHz, which can be achieved quite easily with a Q6600 G0. The 9550 will be at 3740gHz, which would be difficult (but not impossible) to achieve with a Q6600.

So much depends on what you are hoping to achieve, for analysis of games a Q6600 running at 3.2gHz is really probably enough for anyone. If you are looking to be competitive in either an engine room or freestyle chess, then any extra speed you can achieve can be a big help.
Up Topic The Rybka Lounge / Computer Chess / PC Configuration
1 2 Previous Next  

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill