> Re: Fritz uses more power
I can tell which engine is on move in testing simply by fan noise differences in many cases. [Dan Bernstein had a quote about his various super-optimised code once: "I like to hear the gears kick in as the fan goes faster and faster" :)]
By the way, if you just ask, Intel will ship you printed versions of their manuals. (There's a link next to the download link.) Free of charge, no questions asked. I just received a bunch of them -- when it's 2-3000 pages, it's just so much easier having it in paper form :-)
/* Steinar */
I find the manuals very readable, actually. We had some memory access problems in late 2006 or so with the then-new Intel machines, and one of those manuals with a huge title pretty much had everything I needed about the topic.
/* Steinar */
> By the way, the paper doesn't seem to mention what year the given two-fold speedup was
I think the time in quote is about 2002, when GMP 4.1 came out. [I can ask Zimmermann if you really want to know].
All the speed in GMP is still at the assembler level. You can build the "generic" code (no assembler support), and I think you'll get something about 4 times slower. [Unfortunately, I'm having problems testing this]. Indeed, Granlund has been trying to get "vendor sponsorship" from AMD (and others) to include his assembly level optimisations in the GMP distro - a fork has been resisted, but most people use, say, the patches of Gaudry, which, while maybe 10% slower than Granlund's (non-public) code, are still nearly twice as fast as the general ix86 code when used at the bottom level of GMP. [As an example, see the "Optimal" column versus "GMPbench" in the table here].
BTR is almost certainly a loss. Since at least the Pentium Pro, complex instructions are more expensive than short series of simple instructions. A decrement and an AND is absurdly cheap; I couldn't find cycle counts for modern CPUs offhand, but the Pentium needed at least seven cycles for BTR, and I doubt it's gotten much cheaper.
/* Steinar */
>I'm not all that sure your first optimization is a win
As I say, the proof would be in the pudding...
> CPUs do _much_ more intelligent branch prediction than just simple yes/no
No matter how intelligent they are [I might agree with much, but not with its underlining] it cannot be a bad idea to reduce randomness in the input - the only question is whether code sprawl would become a worry.
> and the fact that the last branch is almost never taken
You could just simply ignore the
whileloop over the 3rd and higher bishops - in almost all positions this will not matter. My impression is that a 100% predicted branch should be nearly a
NOP, at least when the condition is simple as here - on the other hand, avoidance of it would also allow a gain from avoiding the preceding
BTR. There is also the issue that (say) the PHENOM has counters for only 3 branches per (aligned) 16-byte block.
> the Pentium needed at least seven cycles for
BTRis 2 cycles on the PHENOM (assuming
maskis in a register) - see Table 13 in Appendix C here. Also,
ANDhas pipeline dependencies, in that you are
ANDing with the result of the
DEC[my impression is that this doesn't matter, but getting AMD and especially Intel to be specific about the vagaries of micro-op decoding is a bit difficult]. Both
ADDshould be 2 bytes in opcodes, though I think DEC might itself become two bytes in 64-bit mode (here is the AMD info; see Table A-2 on the 48-4F row for
DEC, with footnote 5 giving the info about REX in 64-bit mode, so the
DECbecomes FF/1 with the ModRM extension). Finally,
ANDuses 2 registers instead of one.
Seriously, there's a reason why we don't use LOOP anymore, but instead DEC+JNZ. It's the same thing.
/* Steinar */
I guess we could recompile Strelka and measure, but I'm not going to do that anytime soon. :-P
/* Steinar */
> so you'd need to keep the result of the BSR around for a lot longer
I'm guessing (weasel-word) that having the
squarearound should not be too problematic, as the instruction can go anywhere in the code block. Indeed, the bitboard operations in computing (say) mobility will certainly use
squarea couple of times, and if you have any more code, there could be additional opportunities. I guess you are correct that my chastening of
ANDfor needing two registers is a bit harsh, but that's because I was thinking that the
BSR-result was readily available, and thus came for free.
> But the DEC and AND can go in parallel with other stuff.
This is not untrue, but is there a relative advantage in parallel capability with
BTR? I could be missing something, but I don't see it. [You have the whole WH_BISHOP_CODE block in which to schedule].
> we don't use LOOP anymore
Was it ever a good idea? :-P
/* Steinar */
The comparison you make with Fritz 5 (which I translate to Fritz 5.32) is interesting. It has been said by many that Strelka "unlocks" the nps the Rybka squelches in the output. However, even so, Strelka still only calculates about half the nodes per second as Fritz 5.32 (and perhaps 25% or less than Goliath Light 1.5). Does Strelka still somehow "internally" suppress the count by a factor of 2? Or are you referring to something slightly different in the "bean-counting analysis"?
Good luck in computer science! I'm in physics, trying to do the same thing...
> Consider Fruit: you can take the new ideas from Fruit, add them to an existing chess program, and make it better.
What was particularly new in Fruit? The use of History vis-a-vis LateMoveReductions was a bit novel, I guess, but it need not be much of a strength gain. [LMR was already bandied about by Romstad and Markoff a year before Fruit appeared]. VR's comment about FL presents a Fruit-opinion similar to my own: "Fabien is a very good engineer, and also has a very clear and simple conception of how his search should behave." Or we can hear from FL himself: "I can't think of a search feature in it that was not described before. Ditto for evaluation terms (except perhaps a few drawish-endgame rules that activate in one game in a hundred). There are specific principles that I follow in Fruit that gives it a personality somewhat (like never truncating the PV and making sure that mate-depth claims are always correct), but they probably have no impact on strength at all and could even hurt a little."
> It packs most of the search and evaluation of Fruit/Crafty into something almost 3 times faster.
64-bit Strelka is about 3 times as fast as Fruit in NPS --- as Hyatt said back in 1994, eventually bitboards would gain the speed advantage. Also, FL never tried to optimise Fruit per se: "More concretely, in chess [clearer, uncomplicated code] leads to slightly-slower programs (say 25% lower NPS) that are easier to modify. For example Fruit spends a lot of time scanning board squares to evaluate mobility. I know several faster ways I could compute the same thing. However none of them would allow me to easily modify the mobility feature later."
> I did not add any eval terms from Strelka (my view being that Zappa's eval is way better than Strelka's anyway)
Strelka's eval seems quite minimal, largely mobility.
I will also have to assume that you haven't done much programming in C, because you are really underestimating just how great of an optimization job is in Rybka. Look at Fabien's own statement: He thought that Fruit could be optimized by 25%, not 200%.
I believe that many programmers invented it independently.
Movei0.07.99 is using LMR based on my memory(I will need to look at my old code if I saved it to be sure but I am sure I invented it some years ago independently of other people and maybe stefan made it earlier).
Based on my memory I invented it some time between 0.07a and 0.07.99
It is weak because of other reasons but LMR certainly made it stronger.
> I do not think that there is one true inventor of LMR
Indeed, it's just another one of those forward pruning techniques that the commercial guys are using. :)
> but LMR certainly made it stronger
Now we know why it won New-Year-Fun '03. :-P
> I've just realised what this forum needs: a glossary of terms.
I keep saying this Werewolf, no one bothers, sob, sob. :)
> because you are really underestimating just how great of an optimization job is in Rybka.
I am definitely impressed by the optimising in Strelka (it is about 50% faster than my first bitboard engine). But I do not see from what you can claim a 3x engineering gain, unless you compare apples to oranges vis-a-vis bitboards. On a 32-bit computer, Strelka is something like 50% faster (in NPS) than Fruit --- unsurprisingly, this becomes almost 3x at 64 bits. I do not find Strelka to be 3x faster than bitboard engines like Crafty and Glaurung.
> S7-S9 completely dominated the computer chess world.
I'm not sure I quite agree with this description of history. Also, SM-K claims that a big push came in Shredder 8 (no idea about 7.04, which was noticeably better than 7): "Since Shredder 8 there was a completely new search in Shredder, which is extremely selective. As a result Shredder 8 always calculates a few halfmoves deeper than its predecessor. Of course the difference in the search depths of both versions are not simply comparable with each other, however I believe that the new search is much better and there is also clearly more room for further improvements."
>[ LMR] is why S7-S9
Is this an educated guess, or do you have Inside Info? Without any other information, one could speculate that his "forward pruning methods" might have been closer to that, say, of Björnsson and Marsland, for instance.
> because [Fabien] published his work
"For Fruit, which ... made even the densest of us aware of fail-low pruning." :)
In my view, Tord clearly deserves the main credit for LMR. Fabien probably deserves the #2 spot, since Fruit is what really showed beyond any doubt that this works.
> [Shredder 7 already used LMR]
Which leaves us guessing how he gained the ELO in the next two versions. :)
[I think he said Shredderbases and eval changes with passed pawns and king safety for S10/S11, though I don't think bitbases are the whole story for S10 - sorry can't fact-check at this terminal].
> [Fruit showed LMR works]
I'd be a bit more circumspect: Fruit showed it could work - if you already had (say) a whole bunch of other reduction methods, the gain might be small. It also showed that "a simple approach works well" when considering reductions.
I do not care if it was used by other programs in the past because I did not know it.
I also do not agree about definition of new.
If a chess player plays a strong move that was played 100 years ago and was evaluated wrongly then claiming that the player found nothing new is wrong.
The player found that the move that was played 100 years ago is a strong move and the same is about fruit.
Finding that it is good to use some combination of ideas that were suggested in the past is not finding nothing.
It is not something obvious otherwise other programmers could also do it and it is a fact that the commercial programmers at the time of fruit2.1 did not do it(otherwise they could easily be above the level of fruit together with their own ideas.
However I have two requests for you. Please continue your endevour also with Zappa along with your very successful PhD and please visit our forum more often for your valuable insights. For a small community of chess fans (other than the elites) this is the only motivation for us.
Second, even though I've spent less time in academia than you (master's degree vs. graduate student), I'm not so sure novelty rules all. If you cannot show that your novelty has at least a certain advantage in some situations over existing practice, it's not going to be that interesting.
/* Steinar */
Academic Science is about inncocent curiosity (innocent in terms of not thinking about profit or usefulness but only about the beauty of thought). This is the difference to industrial research.
What about chess? Is it useful? Does it improve mankinds situation? Does it help to create a better world? Of course, one can use some far-fetched arguments like "trains your mind" and so on, but there are many chess lovers who do this mind-sport due to the intellectual challenge, the search for amazing ideas, for joy without thinking about 'usefulness'.
I cant blame A. Cozzie for his point of view. Even though he seems not to care too much about Zappa (this might be vanity), he created one of the strongest engines. I think this competition stimulates the development of chess engines further. Maybe Rybka 3 would not be as strong as it is according to L. Kaufmann if the defeat in Mexico would not have taken place.
Also I am not so sure that all there is with Rybka is fast search. Obviously much more is there. If Rybka gets a 3X speed advantage that would mean that Rybka on 3X slower hardware handicap should be performing equal to Crafty or Fruit. It would be interesting to put this to a test .... Rybka on say an Athlon 1.2Ghz against a Quadcore 3.2Ghz Crafty. That would probably be around a 10X speed difference ... yet I think Rybka would easily come out on top. I have a lot of respect for Anthony but I think he still hasn't seen all there is to see in Rybka. The evaluation function in Zappa is great ... and in some cases superior to Rybka ... but in the majority of the games I see, Rybka has much better evaluation. I think people way underrate the value of evaluation tables that are in Rybka ... these have been honed down with thousands of games ... they have been tested and slowly improved and seem to add tremendous knowledge to Rybka. They are not perfect and probably will never be ... but they definetely are right more times than any other program. So a combination of faster search with superior evaluation is what makes Rybka strong.
Rybka can still improve quite a bit I think. Multiprocessor efficiency is still a work in progress and can only improve. Also a better implementation of endgame tablebases (such as implementing bitbases) would allow Rybka to have all 6 piece EGTB's (and some 7) without any drop off in nps. As more and more cores are added to CPU's ... being able to efficiently make use of these cores will be paramount to progress. IMHO the way to do that is to develop the engine in a way where you split the engine into seperate independent modules on independent motherboards ... each module would be extremely good at a specific aspect of the game ... opening ... middle game ... tactics ... positional accuracy ... and ofcourse endgame play. Then you would have communication between these modules and an algorithim that would change dynamically during the course of the game that would set a system to what move to choose and from which module. While communicating between these seperate modules would probably have a large latency hit ... if it works and I am sure it would work ... a quick adaptation of these cores on a single machine with many cores would immediately remove the latency hit and thus give you a performance boost.
Where are all the male programs?
Public Movei is losing against rybka32 bit in these conditions if you use slow time control.
Note that Movei can beat Rybka 32 bit with 25:1 time advantage but maybe the time control that I used was not slow enough for rybka(I did not work on movei in the last months and I prefer to spend my time in understanding strelka before rewriting movei).
If we talk about evaluation then my opinion is that in the majority of the game you cannot know if rybka has better evaluation or Zappa has better evaluation.
The only case when it is clear that program A has better evaluation is if there is a big difference in the evaluation for some moves and it is clear that one of them is right.
In most games it simply does not happen so you cannot say based on watching the games which program has better evaluation.
If program slightly improves her position then it may be because of better evaluation or because of deeper search and you have no way to know based on watching the game.
> I am sure that Crafty is going to lose with 10:1 time handicap against rybka.
I agree completely. Back in the late 90s, Rebel 8.0 gave Crafty 11.17 a 50:1 (or 100:1) handicap in the "NPS Challenge" - the match was aborted, but Rebel won the only game that was completed. I would opine that one game extant (with the analysis) gives a reasonable idea of why Crafty won't win with these time odds.
1)I doubt if rebel can continue in that way and one game proved nothing.
2)Crafty improved from that time.
I also think that Rybka can give 10:1 time handicap to rebel or prodeo and win.
Movei is slightly stronger than prodeo based on rating lists and if rybka can do it to movei it probably can do it to prodeo.
> This single game is a bad example because of the following:
I agree that the game result is a bit useless. However, the engine analysis can also be taken into consideration, for instance, how does Crafty's eval/PV change as depth increases [I thought the Crafty log was available, but I guess not - in fact, I think I was remembering a different 100:1 game (aborted at move 39), Crafty 11.19 versus Hiarcs 6 on rec.games.chess.computer, where the Crafty logs are extant]. Also, I was not trying to predict Crafty-Rybka at 10:1 per se, but rather to demonstrate that large time odds are not insuperable.
Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill