Not logged inRybka Chess Community Forum
Up Topic Rybka Support & Discussion / Rybka Discussion / Stockfish vs. Rybka gain 32 to 64 bits
- - By bob (Gold) Date 2013-09-16 13:39
Tell me EXACTLY what is improved in a program by moving from 32 bit hardware to 64 bits?  Does that somehow make the program qualitatively better?  I'd be interested in hearing ANY explanation to justify that.  The ONLY gain has to be speed.  A program does not get "smarter" just by moving to 64 bit words, or by adding 8 more registers.  That ONLY affects speed.  Hence NPS is the only thing you can use to measure Elo improvement.

Stockfish is gaining almost zero from 64 bits.  It is gaining most of that from 8 extra registers...

In C, there is not much you can do to optimize for additional registers, other than to write better code in the FIRST place.  I have a 64 bit program that compiles and runs just fine on a 32 bit platform with zero changes by me to make the 32 bit program run any faster.
Parent - - By Banned for Life (Gold) Date 2013-09-16 15:08
Stockfish is gaining almost zero from 64 bits.  It is gaining most of that from 8 extra registers...

It's possible that there is significant register pressure on the general purpose registers. In the case of Rybka, I wouldn't be surprised if Vas used the XMM registers for temporary storage. If he made use of the additional 8 XMM registers available under x64, that could explain the >60 Elo loss of running on a 32-bit build. Might be the same with Komodo.
Parent - - By bob (Gold) Date 2013-09-16 15:12
How did Vas access XMM registers from C/C++?
Parent - - By Banned for Life (Gold) Date 2013-09-16 16:07
I never asked, but I would assume using intrinsics for insertps, pins, extractps and pextr.
Parent - - By bob (Gold) Date 2013-09-16 18:59
And, of course, he used those damned floating point 0.0's which would be a MINOR problem.  :)
Parent - - By Banned for Life (Gold) Date 2013-09-16 19:32
They would be a very minor problem, in fact no problem at all...
Parent - - By bob (Gold) Date 2013-09-16 22:50
Since you don't know X86, I suppose it would not be a problem.

do you know what happens to MMX0 when you do that dreaded FLDZ instruction?

do you know what happens to MMX1-7?  Apparently not.  Do you know whether the compiler produces any OTHER fp operations in the code?  Etc.

BTW no one has reported any sign of MMX stuff in Rybka to date, apparently that is a fictional creation.
Parent - - By Banned for Life (Gold) Date 2013-09-16 23:19
Moron!  We were discussing why Rybka 3 and 4 are more than twice as slow when using 32-bit code. You are a complete idiot! Just one stupid, ugly, white trash redneck.
Parent - - By Scott (*****) Date 2013-09-16 23:21

> Moron!  We were discussing why Rybka 3 and 4 are more than twice as slow when using 32-bit code. You are a complete idiot! Just one stupid, ugly, white trash redneck.


Wow, did you run out of Xanax?
Parent - By Banned for Life (Gold) Date 2013-09-16 23:28
:lol:
Parent - - By bob (Gold) Date 2013-09-17 05:45
Guess that means you've run out of arguments.  :)

typical...

You DO realize Rybka has some floating point loads?  :)  IE FLDZ in two places that were discussed here...  blows the hell out of the mmx registers...

It's a pure nonsense explanation.
Parent - - By Banned for Life (Gold) Date 2013-09-17 06:12
We were never discussing mmx registers in ANY context, you moron. LEARN TO READ.
Parent - - By bob (Gold) Date 2013-09-17 17:17
I'm always thinking "Rybka 1.0 beta" since that is THE topic being discussed.  Going back that far,  MMX seemed to be more likely.  But regardless, no SSE instructions have been found in 1.0 beta to date, which pretty much renders this "moot"...
Parent - - By Banned for Life (Gold) Date 2013-09-17 18:04
You're completely senile. The discussion was about why Rybka 3 and Rybka 4 lost so many Elo in a 32-bit build. Both Rybka 3 and Rybka 4 certainly use SSE and later instructions. You need to stop drinking in the morning.
Parent - - By Richard Vida (**) Date 2013-09-17 18:23 Edited 2013-09-17 18:26

> Both Rybka 3 and Rybka 4 certainly use SSE and later instructions.


They don't use any special SSE tricks. Only popcount, prefetch and some memset/memcpy optimizations (the latter is standard c library stuff though). What is funny that the 32bit versions do not seem to use prefetch at all, despite the instruction is available since PentiumIII.

Edit: Pentium II -> Pentium III
Parent - - By Banned for Life (Gold) Date 2013-09-18 03:45
They don't use any special SSE tricks. Only popcount, prefetch and some memset/memcpy optimizations (the latter is standard c library stuff though).

That's interesting. I had a forum discussion with Vas in March 2007 where he stated:

2) For some reason, for Rybka, direct streaming operations kill performance. For example, when you write a hash entry, you want to avoid polluting the cache. However, trying  to use a streaming intrinsic for this absolutely kills Rybka performance (like 10% or so). The only streaming which works for Rybka is non-temporal prefetch.

I guess he either never went back to see what the issue was, or just never made it work. At the time, he had a pretty high threshold for stuff to work on. If he didn't think the improvement would net 3-5 Elo for a week of effort, he wouldn't go for it. He did read through some of Intel's optimization manuals while he was trying these things out. I think this was a priority for a while when he was initially dealing with some issues with access to non-local memory on multi socket boards.

What is funny that the 32bit versions do not seem to use prefetch at all, despite the instruction is available since PentiumIII.

Vas never wanted to spend any time on the 32bit builds, so it would be pretty strange if he went to the trouble of taking out the prefetches. I wonder how that happened?
Parent - By bob (Gold) Date 2013-09-18 13:40 Edited 2013-09-18 15:56
Hmm...  So DOES it use SSE, as you directly claimed, or does it not?  You continue to make pronouncements that are proven to be false.  What do you make of that?
Parent - By Dragon Mist (****) Date 2013-09-17 15:42
:evil:
Parent - - By Lukas Cimiotti (Bronze) Date 2013-09-17 06:00
I talked to Vas several years ago about optimizing speed by using inline asm. He didn't like the idea. In these days he didn't even care about the compiler - he used something very old. Later - around the time of Rybka 3 - he tested some new compilers. But he never did any low-level optimizing.
Parent - - By Banned for Life (Gold) Date 2013-09-17 06:04
Thanks. I had a discussion with him on this forum a number of years back about using streaming extensions. I think he at least tried something, but could be mistaken.

Any idea why there is such a huge difference between 64 and 32 bit builds?
Parent - By Lukas Cimiotti (Bronze) Date 2013-09-17 06:29

>Any idea why there is such a huge difference between 64 and 32 bit builds?


1. Bitboards are great for 64 bit - as we all know.
2. For new CPUs using POPCNT adds some speed (that's an intrinsic). Of course it only works for bitboards.
3. I guess modern compilers make good use of additional registers.
Parent - By Richard Vida (**) Date 2013-09-18 23:01

> Any idea why there is such a huge difference between 64 and 32 bit builds?


Yes. The 32bit builds of later Rybkas are heavily flawed. With the flaws corrected I would guess they would be only 30-40 points below its 64bit counterparts.
Parent - By Banned for Life (Gold) Date 2013-09-17 06:17
From the discussion, it looks like he used an intrinsic to prefetch TT entries. I would be surprised if he wasn't doing this...
Parent - By Banned for Life (Gold) Date 2013-09-17 06:21
In these days he didn't even care about the compiler - he used something very old.

I think he was using the free msvc 2003 version in the Rybka 1.0 Beta and Rybka 1 time frame. They had an SDK which provided x64 capabilities which he would have had to use to do the 64-bit builds.
Up Topic Rybka Support & Discussion / Rybka Discussion / Stockfish vs. Rybka gain 32 to 64 bits

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill