Not logged inRybka Chess Community Forum
Up Topic Rybka Support & Discussion / Rybka Discussion / Rybka @ 10th ICT Leiden 2010
1 2 Previous Next  
Parent - - By Banned for Life (Gold) Date 2010-06-05 03:30
Sorry about all the boards at starting position. I forgot that this happened automatically when a FEN was posted...
Parent - By Mark (****) [us] Date 2010-06-05 10:27
At least I'm on my desktop computer now.  Sometimes it's a pain to scroll through a lot of boards on my phone!  :)
Parent - - By Akbarfan (***) Date 2010-06-05 14:35
you could simply edit it - within 1 hours :-)
Parent - By Banned for Life (Gold) Date 2010-06-06 06:14
I didn't think of that...
Parent - - By Vasik Rajlich (Silver) [pl] Date 2010-06-05 11:59

> The high variance around the trend line is due to the small set size, which allows important results to be lost even when the hash is not very full.


I am pretty sure that the variance is due to a combination of search luck and the gradual, systematic overwriting of low-depth entries throughout the the search. Higher associativity wouldn't change anything.

> Given the variance in depth as a function of time due to hash size alone, it is reasonable to assume that a better hash policy could net at least a half ply of depth, since it's clear that results are being lost even during short time periods with a large hash.


If you double the hash size 9 times like in your example, then yes, you can expect a significant performance hit. An ideal replacement policy compared to a decent replacement policy might be the equivalent of multiplying the hash by a factor of 1.5 (let's say).

> Changing gears, I guess if you're 4-way set associative, you're back with 4 16-byte hash entries. This would mean that Rybka will continue to run a little faster on i7 systems with one memory channel as opposed to two or three (as Lukas confirmed for the R3 case).


8 bytes per entry. I'm not sure why fewer memory channels should ever help (although in practice this too doesn't really matter).

Vas
Parent - - By Banned for Life (Gold) Date 2010-06-06 06:13 Edited 2010-06-06 06:18
I am pretty sure that the variance is due to a combination of search luck and the gradual, systematic overwriting of low-depth entries throughout the the search. Higher associativity wouldn't change anything.

I used a single process and the search was 100% consistent when the same hash size was used, but was different almost immediately when I went from 1 GB hash to 2 GB hash. Unless there is something in your code that is a function of the hash size, this has to be due to differences in retrieving positions from hash when the hash is nearly empty. Assuming there isn't a bug, this would have to be from results available in one case and not in the other. Of course, once this happens even once, the chaotic nature evident here will cause cascading differences in move order and evaluations.

An ideal replacement policy compared to a decent replacement policy might be the equivalent of multiplying the hash by a factor of 1.5 (let's say).

Computer designers know better than this, and they pay a smaller penalty for being wrong (the ratio of the time it takes to get a cache line from memory to the time it takes to get it from a cache). Cache management for computers has come a long way since LRU was in vogue twenty years ago, and is an important contributor to recent performance gains.

An ideal hash replacement policy and its performance would be difficult to define. A better comparison would be to a lossless hash (i.e. every position you put in there can be retrieved at a later point). I suspect you are taking a pretty significant hit relative to this perfect associative memory, a lot more than a few Elo.

I'm not sure why fewer memory channels should ever help (although in practice this too doesn't really matter).

If a single DDR3 memory channel brings in data 64 bytes at a time (and your hash sets are less than or equal to 64 bytes), then the number of bytes brought in is 64 x the number of memory channels. The memory controller brings in the next cache line (dual channel) or the next two cache lines (triple channel) on the assumption of locality. Obviously these extra cache lines won't be used in a hashed table. So the single channel system has a little less latency and runs a little faster. If you are using 4 X 8 = 32 byte hash lines, someone with triple channel memory is wasting 5/6ths of his bandwidth. That doesn't really matter much as long as memory utilization is low.

In practice memory bandwidth has many of the characteristics of a highway. You can add more cars with no ill effect, and then at some point, things degrade rapidly into gridlock. For Rybka, memory bandwidth requirements increase proportionately to number of processes times core speed. I'm not sure how far you are from the edge of the cliff at this point. If an overclocked dual six-core unit isn't close, it's nothing to worry about. But someone on a budget can get by with one channel and get a slight improvement, and should have an easier time overclocking.
Parent - - By Vasik Rajlich (Silver) [pl] Date 2010-06-06 12:31

> I used a single process and the search was 100% consistent when the same hash size was used, but was different almost immediately when I went from 1 GB hash to 2 GB hash. Unless there is something in your code that is a function of the hash size, this has to be due to differences in retrieving positions from hash when the hash is nearly empty. Assuming there isn't a bug, this would have to be from results available in one case and not in the other. Of course, once this happens even once, the chaotic nature evident here will cause cascading differences in move order and evaluations.


That's search luck. One tiny difference upstream can cause massive differences later on - it's a butterfly effect.

> Computer designers know better than this, and they pay a smaller penalty for being wrong (the ratio of the time it takes to get a cache line from memory to the time it takes to get it from a cache). Cache management for computers has come a long way since LRU was in vogue twenty years ago, and is an important contributor to recent performance gains.


This is mainly because the caches are now a lot bigger, which is due to CPU designers struggling to take full advantage of Moore's Law.

I'm aware that CPU caches typically have an associativity higher than 4, but this is partly because they are implemented in hardware. For Rybka, doubling the associativity would roughly double the read and write times. For a hardware cache, I guess access time is either unrelated to associativity or some sort of a log2 ()-type function of associativity.

> An ideal hash replacement policy and its performance would be difficult to define. A better comparison would be to a lossless hash (i.e. every position you put in there can be retrieved at a later point). I suspect you are taking a pretty significant hit relative to this perfect associative memory, a lot more than a few Elo.


Sure. All I'm saying is that you won't get this kind of perfomance gain with tweaks to a good-but-not-great hashing algorithm.

> The memory controller brings in the next cache line (dual channel) or the next two cache lines (triple channel) on the assumption of locality. Obviously these extra cache lines won't be used in a hashed table. So the single channel system has a little less latency and runs a little faster.


Aha. I wonder if pre-fetching would disable that. That seems like a very double-edged tradeoff for many types of applications, not just a chess engine.

> In practice memory bandwidth has many of the characteristics of a highway. You can add more cars with no ill effect, and then at some point, things degrade rapidly into gridlock. For Rybka, memory bandwidth requirements increase proportionately to number of processes times core speed. I'm not sure how far you are from the edge of the cliff at this point.


It can't be very close. There are a ton of operations per memory fetch.

Vas
Parent - - By Banned for Life (Gold) Date 2010-06-06 18:21
That's search luck. One tiny difference upstream can cause massive differences later on - it's a butterfly effect.

You've stated in the past that the chaotic nature of search is a good thing. I'm not sure if this is true or not, but unnecessarily chaotic nature in cache isn't good. It's clear that changing from a 1 GB to 2 GB hash causes differences in what gets found in only a few seconds on a slow single core processor. From what you've written above, this seems likely to be due to the one entry for most recent position being different.

This is mainly because the caches are now a lot bigger, which is due to CPU designers struggling to take full advantage of Moore's Law.

Caches have more capacity now due to smaller transistors, but cache controllers have grown much faster than the cache memory itself and are now smart enough to sometimes intelligently predict which cache lines to get rid of when new cache lines must be brought in (more intelligently than just getting rid of the oldest ones first like they did when I was working on them). Obviously, no amount of smarts will help with a hashed table.

For Rybka, doubling the associativity would roughly double the read and write times.

Maybe I'm misunderstanding something, but if your set size is 4 and your bytes per hash entry is 8, it seems you are using 32 bytes per hash record. Intel processors have been using 64 byte cache lines for a while now, so it's not obvious why you would use less. DDR3 memories put out 64 bytes per access whether you want them or not. You don't even gain in latency for using less than this. Furthermore, for a system with dual channel DDR3, you should see no appreciable increase in read and write times when using two contiguous cache lines, and with triple channel, you should see no appreciable increase in read and write times when using three contiguous cache lines.

Rather than the memory read/write times, which shouldn't change if you are sticking within a cache line and not writing back prematurely, you seem to be very concerned with the amount of processing you do within a hash set. This processing will probably increase linearly with the set size, but the importance of the delay is highly dependent on the depths of the entries in the set. Burning a few hundred more processing cycles to increase the probability of saving a shallow depth search may not be worth it. Saving a high depth search is another matter. So there is good reason to believe that a larger set size will be more productive for slower time controls, and maybe counterproductive for fast time controls.

The strange thing here is that computer designers take the advantages of larger set size as a given and trade them off against latency using well established Markov models. For whatever reason, many chess engine developers don't seem to accept this.

I wonder if pre-fetching would disable that. That seems like a very double-edged tradeoff for many types of applications, not just a chess engine.

Prefetching can save you from seeing the added latency, but it can't reduce the number of bytes brought in during a DDR3 memory access. By the way, you're not alone. There are still a lot of applications that don't benefit from dual channel DDR3, and very few that benefit from triple channel DDR3 (most actually lose performance with triple channel).

There are a ton of operations per memory fetch.

I did a comparison many years ago to estimate the benefits of a larger cache for a customer that was doing a shitload of processing for each memory access (his description). In that case, it made sense to get the extra cache. But I'm not sure if a shitload is bigger or smaller than a ton! :-D
Parent - - By Vasik Rajlich (Silver) [pl] Date 2010-06-07 20:40

> It's clear that changing from a 1 GB to 2 GB hash causes differences in what gets found in only a few seconds on a slow single core processor. From what you've written above, this seems likely to be due to the one entry for most recent position being different.


Probably, but it could also be due to a bunch of higher-depth entries congregating at one place. The point is that one tiny difference is all you need for everything to go in a totally different direction.

> but cache controllers have grown much faster than the cache memory itself and are now smart enough to sometimes intelligently predict which cache lines to get rid of when new cache lines must be brought in (more intelligently than just getting rid of the oldest ones first like they did when I was working on them).


I don't know about the "much faster" part, but it's true that cache controllers are much smarter now. Unfortunately there is no real analogy between that and a chess engine hash table. In a chess engine hash table, very simple heuristics about what to keep and what to discard work pretty well, so it's hard to improve on them. It's just obvious that depth and recency are important.

> Maybe I'm misunderstanding something, but if your set size is 4 and your bytes per hash entry is 8, it seems you are using 32 bytes per hash record. Intel processors have been using 64 byte cache lines for a while now, so it's not obvious why you would use less.


If I used a set size of 8, I would need to do 8 compare operations for each probe or write.

> Furthermore, for a system with dual channel DDR3, you should see no appreciable increase in read and write times when using two contiguous cache lines, and with triple channel, you should see no appreciable increase in read and write times when using three contiguous cache lines.


I guess that's true for streaming operations but I see no way to take advantage of it.

> Burning a few hundred more processing cycles to increase the probability of saving a shallow depth search may not be worth it. Saving a high depth search is another matter. So there is good reason to believe that a larger set size will be more productive for slower time controls, and maybe counterproductive for fast time controls.


I guess that makes sense. A larger set size should also be more productive for smaller hash sizes.

From what I remember, though, 8-way associativity just didn't help at all under any conditions. I think the point is that important hash entries always manage to get stored and less important hash entries just don't matter much.

> The strange thing here is that computer designers take the advantages of larger set size as a given and trade them off against latency using well established Markov models. For whatever reason, many chess engine developers don't seem to accept this.


CPU design teams will have a few cache experts who have nothing else to do, so of course they analyze this problem down to the bone.

Engines are written by one guy, who has way too much to do and not nearly enough time to do it. So, little things like hash replacement policies just don't get much attention.

> There are still a lot of applications that don't benefit from dual channel DDR3, and very few that benefit from triple channel DDR3 (most actually lose performance with triple channel).


Interesting. I guess there was no easy way to make it disableable and it was deemed beneficial on average.

> I did a comparison many years ago to estimate the benefits of a larger cache for a customer that was doing a shitload of processing for each memory access (his description). In that case, it made sense to get the extra cache. But I'm not sure if a shitload is bigger or smaller than a ton! 


Shitload is a word you use when you're trying to exaggerate, so it doesn't surprise me. :)

Vas
Parent - - By Banned for Life (Gold) Date 2010-06-08 15:28
If I used a set size of 8, I would need to do 8 compare operations for each probe or write.

Probes and writes are very different and generally don't belong in the same sentence. The memory read operation associated with a probe takes orders of magnitude more time than a few compare operations with data in L1 cache. Unless you have decoupled the identification of positions to probe from the evaluation of those positions, the compares will be a negligible portion of the delay. Assuming you aren't doing this, even if you prefetch as soon as you have the hash index, it's likely you're still spending a lot of cycles waiting for the data.

Hash writes are non-blocking and therefor won't incur a delay, so the compare operations may be very significant. But even a small increase in hash hit rate would more than offset the time required for the compare operations in this case, so you appear to be making the claim that the increase in hit rate would be very close to zero. This is contradicted by the increase in strength from using a larger hash, which shares many of the characteristics of using a larger set size assuming a reasonable replacement algorithm is being utilized.

I guess that's true for streaming operations but I see no way to take advantage of it.

Streaming or not, the memory system on an i7 with three memory channels can't bring in less than 192 bytes.

CPU design teams will have a few cache experts who have nothing else to do, so of course they analyze this problem down to the bone. Engines are written by one guy, who has way too much to do and not nearly enough time to do it. So, little things like hash replacement policies just don't get much attention.

Agreed. The CPU is big business with a capital b, and the alternative to architecture improvements is new fabs at $5 billion apiece. This makes it very cost effective to have a whole team of engineers looking at each aspect of the problem, hoping to gain a few percent speed increase here and there. Chess engine development is on the other end of the scale with only a few people working on them full time, so only the biggest issues are addressed, and anything else becomes unimportant.

The problem with this, visa-vie Rybka, is that Rybka is pretty painful to do interactive analysis with because when you go down a secondary line, you almost instantly lose what you had stored on the main line. Until R4 came out, I assumed you were going to address this with a more advanced persistent hash. Now that the persistent hash is gone, the alternatives are the frustration of analyzing with Rybka, versus the more pleasant experience of analyzing with another weaker engine.
Parent - By Vasik Rajlich (Silver) [pl] Date 2010-06-08 21:31

> The memory read operation associated with a probe takes orders of magnitude more time than a few compare operations with data in L1 cache. Unless you have decoupled the identification of positions to probe from the evaluation of those positions, the compares will be a negligible portion of the delay. Assuming you aren't doing this, even if you prefetch as soon as you have the hash index, it's likely you're still spending a lot of cycles waiting for the data.


I haven't checked this with VTune, but I'm pretty sure that there is no delay. There are a ton (:)) of operations between the time the prefetch is started and the time the prefetched data is used.

> But even a small increase in hash hit rate would more than offset the time required for the compare operations in this case


Yes, probably. Intelligence at the cost of speed is usually a good tradeoff, because it translates better to longer time time controls, better prepares for tomorrow's hardware, and is friendlier to further changes.

> But even a small increase in hash hit rate would more than offset the time required for the compare operations in this case, so you appear to be making the claim that the increase in hit rate would be very close to zero. This is contradicted by the increase in strength from using a larger hash, which shares many of the characteristics of using a larger set size assuming a reasonable replacement algorithm is being utilized.


Is there some general theoretical model for this? Ie. increasing the set size from 4 to 8 for some general case should be the equivalent of increasing the hash size by how much?

As far as I remember, there was no benefit at all going from 4 to 8, with the caveat that in my case going from 4 to 8 decreases the MRU part of my hash from 1/4th to 1/8th of the total hash. Maybe I'll try this again, just out of curiosity.

> The problem with this, visa-vie Rybka, is that Rybka is pretty painful to do interactive analysis with because when you go down a secondary line, you almost instantly lose what you had stored on the main line. Until R4 came out, I assumed you were going to address this with a more advanced persistent hash. Now that the persistent hash is gone, the alternatives are the frustration of analyzing with Rybka, versus the more pleasant experience of analyzing with another weaker engine.


This seems to be some sort of a special problem, I need to dig into it.

Vas
Parent - - By Akorps (**) [us] Date 2010-06-07 14:27
Pardon me for asking a stupid question, but a friend asked me if Rybka's favorite move when analyzing 6-lines in multi-variation mode would be the same (at the same ply depth) as in single variation mode. I said not necessarily, but I thought it might be a good idea to ask in the forum to make sure.
Parent - By Vasik Rajlich (Silver) [pl] Date 2010-06-07 20:41
Not necessarily - there will be systematic differences.

Vas
Parent - - By Lukas Cimiotti (Bronze) [de] Date 2010-06-06 14:00

>I'm not sure how far you are from the edge of the cliff at this point.


This heavily depends on the Rybka version and on hardware. Rybka 2.3.2a obviously made a lot more memory accesses than Rybka 3.
You could encounter this problem running Rybka 2.3.2a on a Skulltrail which has a good integer performance but lousy memory speed despite Intel claimed it was good. Going from 2.8 to 4 GHz should add 42% of performance if memory was not an issue. IIRC speedup was less than 30%. OK, this might also be caused by inter CPU communication via FSB. The problem got better with Rybka 3 and much better with Nehalem which has much better memory throughput and lower latencies.
But if you look at the speedup you gain from using large pages, you know that we are not that far away from the edge of the cliff.
Parent - - By Banned for Life (Gold) Date 2010-06-06 17:37
But if you look at the speedup you gain from using large pages, you know that we are not that far away from the edge of the cliff.

When you use small pages (4KB) with a large hash, and you look at the number of records in the TLB, you see that you will need a new TLB record for each hash access, i.e. every hash access will take 2 memory accesses rather than 1. Huge pages (sorry I can't force myself to use the Microsoft lingo of large pages) gets past this problem by allowing you to make the pages large enough such that the memory for your hash table takes only a portion of your TLB records (and these never need to be changed). So in effect the huge pages are cutting your memory accesses in half, and even more importantly, cutting your latency in half (or even more, because the TLB has to do a lot more than just get a new record when it needs to access a location that has no virtual to physical translation record in storage. The question is: Does reducing memory latency by 50% or more equate to a 10-15% performance improvement for a chess engine? I'm guessing it does, because you see this same ballpark improvement with other engines when you go from small pages to huge pages.

Anyway, look at the bright side. If number of cores per socket, sockets per motherboard, and core speed increase to the point where memory bandwidth becomes a problem, Vas has left himself plenty of room to improve efficiency in this area!
Parent - By Vasik Rajlich (Silver) [pl] Date 2010-06-07 20:44

> Anyway, look at the bright side. If number of cores per socket, sockets per motherboard, and core speed increase to the point where memory bandwidth becomes a problem, Vas has left himself plenty of room to improve efficiency in this area!


Not for the reason you think.

I could easily probe less without any major performance loss.

The performance of the hash table can't be substantially improved, I am sure about it.

Vas
Parent - By keoki010 (Silver) [us] Date 2010-06-02 18:05

> I tried to go through all the steps, but after 5 minutes I was still on depth 17.  I guess this is one of the positions that the branching simply blows up into all sorts of lines.  Even if I would follow the steps mentioned and would be able to reproduce the bug, I really think you can't call that a blanket failure of MPV.  A more accurate description would be that in some special cases MPV doesn't work like you would like it to.  I use MPV analysis always and used it specifically when watching the Cluster games in Lieden.  Had absolutely no problems.  Maybe in some positions there is a problem or maybe when some certain steps are taken a bug can be produced, but MPV certainly works.  By the way this sounds more like the hash simply gets filled and you need to flush it.  I mean you are filling the hash tables with single PV data and then all of a sudden you switch to MPV after a lot of the tables are already filled with data from the single PV ... so what does the engine do?  I guess if you want it to discard all the single PV data and start a brand new MPV data set from scratch then you should clear the hash.  Does MPV also not work if you clear hash in that situation?  Is this bug also there if you are using a lot of RAM on your system and if you use larger hash settings?


M Ansari check my post for the position in the Aquarium forum. I start that position multpv and it not only hangs with r3 and r4 but the engine hangs itself also and you have to exit aquarium to get another engine in the analysis.
Parent - By Uly (Gold) [mx] Date 2010-06-01 07:47
And this:

http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=17152

(Many people reproducing the MultiPV bug)
Parent - - By Bobby C (****) Date 2010-06-01 07:57 Edited 2010-06-01 08:08

> Multi-pv works very well in Rybka 4.


Maybe your Rybka. So you claim I don't know how to use a chess engine? You obviously haven't read the threads where only about a dozen or so people have reproduced this glitch/bug/show stopper. I am guessing you don't even use the same Rybka as the rest of us, you beta tested it but you have the cluster and I am sure if any bugs arise with it Vas hops on it and fixes them.
Parent - By Geomusic (*****) Date 2010-06-01 08:05
really? lol...
Parent - By bnc (***) Date 2010-06-01 12:12
Bobby C is not the only user who has seen this problem with multipv analysis.
Parent - - By yanquis1972 (****) [us] Date 2010-06-02 15:13
bobby, vas himself (who seems far less disingenuous than a lot of those who defend him) said on this forum in response  to a question of mine that mpv improvement over rybka 3 was minimal. in other words, he didn't spend much if any time working on it for 'us', only the cluster; ie you are correct.
Parent - By Bobby C (****) Date 2010-06-02 19:51
I know I am correct. One guy saying he won't buy another Rybka won't motivate Vas but if the bugs in this version are not fixed I will do my best to organize a boycott against Vas, Conveka , Chessbase or anyone who supports Vas and Rybka. I am seriously pissed off and they think they have problems with some other guys recently...
Parent - - By Vasik Rajlich (Silver) [pl] Date 2010-06-04 15:29

> Anytime I try to forward analyze a position then return to the root and switch to multi-pv Rybka 4 freezes up.


Yes, I will check this.

Vas
Parent - - By Bobby C (****) Date 2010-06-04 19:07
Thank you Vas, for a correspondence player who often switches from single to multip-pv during his analysis this bug is very frustrating. If you manage to fix this a great many of your customers would be very happy and appreciative.
Parent - - By turbojuice1122 (Gold) [us] Date 2010-06-04 20:08
+1 (both for post and customers :-) )
Parent - By keoki010 (Silver) [us] Date 2010-06-04 20:29
+2 :smile:
Parent - - By Vasik Rajlich (Silver) [pl] Date 2010-06-05 12:00
I reproduced it several times yesterday but will need to think about it.

Vas
Parent - By Bobby C (****) Date 2010-06-05 18:21
Thanks for looking into it. Rybka 4 is a good engine for engine matches and probably will win a lot of tournaments. For analysis it is very frustrating and I am guessing the majority of your customers don't buy Rybka 4 to play on playchess. I currently am using other engines for analysis but would love to be able to really use Rybka 4 because it is just flat out better without that bug.
Parent - - By pokerpawn (***) [be] Date 2010-06-02 01:37
the result is especially impressive when you see that Rybka didn't play the last 4 , but played all the best programs ....

Parent - By Dragon Mist (****) [hr] Date 2010-06-02 23:11
Ah yes, that gives another picture all together ..., and to think that we lost to Sjeng because of cluster malfunction ... superb, superb job Rybka!
- - By mdt (**) [uy] Date 2010-06-21 00:02
what time control was used in the ICT?
Parent - By pokerpawn (***) [be] Date 2010-06-21 04:32
think it was 90 min(1u30) per player per game
Up Topic Rybka Support & Discussion / Rybka Discussion / Rybka @ 10th ICT Leiden 2010
1 2 Previous Next  

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill