Stockfish is gaining almost zero from 64 bits. It is gaining most of that from 8 extra registers...
In C, there is not much you can do to optimize for additional registers, other than to write better code in the FIRST place. I have a 64 bit program that compiles and runs just fine on a 32 bit platform with zero changes by me to make the 32 bit program run any faster.
It's possible that there is significant register pressure on the general purpose registers. In the case of Rybka, I wouldn't be surprised if Vas used the XMM registers for temporary storage. If he made use of the additional 8 XMM registers available under x64, that could explain the >60 Elo loss of running on a 32-bit build. Might be the same with Komodo.
do you know what happens to MMX0 when you do that dreaded FLDZ instruction?
do you know what happens to MMX1-7? Apparently not. Do you know whether the compiler produces any OTHER fp operations in the code? Etc.
BTW no one has reported any sign of MMX stuff in Rybka to date, apparently that is a fictional creation.
> Moron! We were discussing why Rybka 3 and 4 are more than twice as slow when using 32-bit code. You are a complete idiot! Just one stupid, ugly, white trash redneck.
Wow, did you run out of Xanax?
You DO realize Rybka has some floating point loads? :) IE FLDZ in two places that were discussed here... blows the hell out of the mmx registers...
It's a pure nonsense explanation.
> Both Rybka 3 and Rybka 4 certainly use SSE and later instructions.
They don't use any special SSE tricks. Only popcount, prefetch and some memset/memcpy optimizations (the latter is standard c library stuff though). What is funny that the 32bit versions do not seem to use prefetch at all, despite the instruction is available since PentiumIII.
Edit: Pentium II -> Pentium III
That's interesting. I had a forum discussion with Vas in March 2007 where he stated:
2) For some reason, for Rybka, direct streaming operations kill performance. For example, when you write a hash entry, you want to avoid polluting the cache. However, trying to use a streaming intrinsic for this absolutely kills Rybka performance (like 10% or so). The only streaming which works for Rybka is non-temporal prefetch.
I guess he either never went back to see what the issue was, or just never made it work. At the time, he had a pretty high threshold for stuff to work on. If he didn't think the improvement would net 3-5 Elo for a week of effort, he wouldn't go for it. He did read through some of Intel's optimization manuals while he was trying these things out. I think this was a priority for a while when he was initially dealing with some issues with access to non-local memory on multi socket boards.
What is funny that the 32bit versions do not seem to use prefetch at all, despite the instruction is available since PentiumIII.
Vas never wanted to spend any time on the 32bit builds, so it would be pretty strange if he went to the trouble of taking out the prefetches. I wonder how that happened?
Any idea why there is such a huge difference between 64 and 32 bit builds?
>Any idea why there is such a huge difference between 64 and 32 bit builds?
1. Bitboards are great for 64 bit - as we all know.
2. For new CPUs using POPCNT adds some speed (that's an intrinsic). Of course it only works for bitboards.
3. I guess modern compilers make good use of additional registers.
> Any idea why there is such a huge difference between 64 and 32 bit builds?
Yes. The 32bit builds of later Rybkas are heavily flawed. With the flaws corrected I would guess they would be only 30-40 points below its 64bit counterparts.
I think he was using the free msvc 2003 version in the Rybka 1.0 Beta and Rybka 1 time frame. They had an SDK which provided x64 capabilities which he would have had to use to do the 64-bit builds.
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill