Not logged inRybka Chess Community Forum
Up Topic The Rybka Lounge / Computer Chess / MCTS weakness wrt AB (via Daniel Shawul)
- - By Chris Whittington (**) [gb] Date 2017-12-25 13:13
"Weakness" of MCTS wrt AB (paper2012)

https://www.aaai.org/ocs/index.php/ICAPS/ICAPS10/paper/download/1458/1571

am pondering on this surrounded by Christmas festivities ... think I have it worked out how AZ gets around this real problem and actually exploits it. will expound later .....
Parent - By Banned for Life (Gold) Date 2017-12-25 17:57
You seem like you would be a much better candidate for Festivus, especially the Airing of Grievances! :lol:
Parent - By Banned for Life (Gold) Date 2017-12-25 18:00
Thanks for the link. Very interesting paper. It's actually from 2010 if we're quibbling! :lol:
Parent - By Banned for Life (Gold) Date 2017-12-25 18:12
This is a link to another paper by the same author from 2011 that addresses some other issues...

https://aaai.org/ocs/index.php/ICAPS/ICAPS11/paper/download/2708/3154
Parent - By Banned for Life (Gold) Date 2017-12-25 18:15
Finally, a more comprehensive paper by the same authors, also from 2011:

https://arxiv.org/pdf/1203.4011
Parent - - By Chris Whittington (**) [gb] Date 2017-12-25 19:22
Daniel Shawul wrote: "UCB will viist nodes that are proven statistically inferior with confidence bounds less and less (exponentially visit rate maybe). The effort spent on week looking moves is really small (especially when you use a small exploration coefficient), which is good for games like Go, but for chess it can miss shallow traps. Using a larger exploration coefficient will pull the algorithms towards minimax that can solve those traps but then it will have a huge branching factor. Unless there is an algorithmic breakthrough the only way to deal with this problem is to avoid getting into this kind of situations. This is one of the things I want explained besides the strength of the pure policy network player."

Yes, exactly. "The way to deal with this problem is to avoid getting into these kinds of situations".

Well, it's the combination of the trained ANN with the MCTS in gameplay when using depth limited search and using the ANN as leaf evaluator. The ANN evaluation is basically win rate from that position (I think). It's maximising on having more rollout wins from a position than it has losses.

Let's go with this "trap" terminology. Generally, if we fall into a trap we lose, if the opponent falls into a trap he loses. So, we can argue that the ANN output is also a function(traps further down the tree where we win / traps further down the tree where we lose).

So, Daniel got it. "The way to deal with this problem is to avoid getting into these kinds of situations", although he didn't state the negation, namely: "to increase the possibilities of the opponent getting into these kinds of situations".

Precisely what the AZ ANN is trained to do, play moves that increase OUR win rate down the tree and increase OPPONENT loss rate down the tree. Or, in terms of "trap" terminology, play moves that increase probability of OPPONENT falling into traps and decrease probability of US falling into traps.

So, what exactly is the AZ ANN trained to do? Play towards those parts of the tree where we have shots at winning and the opponent doesn't. AZ wants chaotic regions of tree with high self win rate and low self loss rate. And, that's where the ANN takes it.

Ok, what is important to realise is that AZ ANN is NOT finding actual forced wins. It's finding almost wins. Positions where the rollouts found on average more wins than losses. Where, but for some defence, AZ wins. In other words it finds THREATS. Potential threats. For example where if opponent moves his Knight, he gets mated. The regional search of MCTS likes this. AZ is maximising on multiple threats, multiple ways to win tree regions. Whilst denying these to opponent.

Imagine a position, with two moves X and Y. Opponent moves X, all is ok. Opponent moves Y and  he is mate.
AB program says this position is ok, 0.5.
MCTS says this position is average of ok and a win, 0.75. It's a regional look at the tree. Play to regions where our side has threats. AB has absolutely no idea at all about the search region, all it cares about is the one main line.

As pointed out, the strength of AB is ability to pick out a trap line beyond opponent horizon. The strength of MCTS with ANN is to choose to take AB into unbalanced regions where MCTS ANN has lots of possible traps and AB has few.
Parent - - By h.g.muller (****) [nl] Date 2017-12-26 09:15
The policy output of the NN is separate from the evaluation output. So the NN is not limited to recommending only good moves for search. It starts with no knowlege and a large branching factor. Under these conditions it will recognize the traps, but, as Daniel says, a lot of visits will go into this to discover the move is poisoned. So the NN will be trained to recognize possible traps, and allocate a large number of visits to it. This prevents their masking by clearcut moves that on average have higher score (but often not the highest).

The problems arise when you use score to guide reduction. Risky moves will on average have lower score, but often have the highest score, and impeding their search because of the low average score will blind you towards recognizing when they are good.
Parent - - By Chris Whittington (**) [gb] Date 2017-12-26 20:56
I'm assuming the network value is trained on game result, and, generally will settle in a well trained NN to win rate.
I also assume the policy network is trained on visit count or popularity. Popularity in a well trained network with sufficiwnt roll out games maps to win rate, I think.

Posit a node, with two possibel moves A and B. A is ak ok move, B is ok except there's a trap line (eg objectively it's not ok).
During training, until the trap becomes apparent, both A and B will have similar win rates, and similar popularity counts. When the training comes across the trap, the win rate of B will fall and move selection of A will tend to increase. The eventually well trained NN will have learnt: A value better than B value. And the policy network will have learnt A more popular than B.

We don't know how AZ as a playing engine runs its MCTS search. I am guessing, btu I assume it is depth limited, returning a value based on winrate to act as "rollout result". And I also guess that the policy network chooses to move order on the basis of the learnt popularity count.

If the finished ANN was a perfect 32man EGTB, then the initial rollout would be the main line. The value of its ply one move would be exact. The response to it, the best available and so on.
The AZ ANN isn't perfect, but a good enough ANN is mostly going to generate senseful lines (I assume), the leaf node values will be imperfect but should average back smoothing out the noise. And the move ordering policy ought to be selecting moves in a senseful order. And, because the ANN is maximising on playout winrate, the value and policy network will bias the search engine to high self win (traps low for us, traps high for opponent) unbalanced parts of the tree, where our side has threats/chances and the opponent doesn't. Hence the Tal ish play style.

So, I see the role of the player phase search as a means of dealing with the noisy evals and imperfect policy of the imperfect MCTS. Most of the "trap-finding" problem solution has been already encoded into the ANN.
Parent - - By h.g.muller (****) [nl] Date 2017-12-27 09:44
The case you describe is not the one that causes MCTS a hard time. It is the reverse, where a move superficially looks bad, and only can be recognized as good after a large number of visits. Because its initial qualification as bad strongly discourages those visits. This can be remedied by recognizing (through the NN) moves that on the average are bad, but have the potential to become good, and occasionally do so. And then assign priority to their search, first making sure they are indeed bad, so that searching the alternatives makes sense.

The NN should be able to learn that when the training uses deep enough search to eventually see the move is good, because from that point on most of the plays would go into that move. And initially, in the random phase, it will not be discouraged yet from investigating the move, because it cannot recognize the superficial badness yet. So the NN learns to look beyond superficial badness from the start, if that is within its capabilities.
Parent - - By Chris Whittington (**) [gb] Date 2017-12-27 11:10
I'm getting the impression you're talking about and interested in an ANN algorithm improved over AZ. I'm just trying to understand AZ at the moment. ok, no matter.

yes, if a move from some node in the tree looks initially bad, UCT algo will tend to reduce visits to it. However, this is compensated to some extent because the parent move will look good, and the UCT will tend to increase playouts of the parent.

what is basically being described here is indeed the trade off between AB and MCTS. AB is good at picking out single lines. MCTS (especialy with ANN) is good at steering towards chaos regions where engine has many win possibilities and opponent doesn't.

question is: can MCTS steer AB into one-sided chaos and eventually overload AB to the point of collapse, or can AB find the single path(s) out the other side?

you may find, btw, that some MCTS selects game play root move on popularity and some select on win rate. ideally these both point to the same move. if they don't it's a sign something is going on. it would be difficult, and needmany rollouts, but a comparison of rate of change of one wrt the other might cunningly be added to UCT.
in gameplay mode using ANN, then rollout winrate disparities with ANN predictions would also be a flag.
Parent - By Chris Whittington (**) [fr] Date 2017-12-27 13:30 Edited 2017-12-27 20:35
Add to UCT = win rate + exploration function, a Surprise function.

Keep another value in tree, 64 bit integer containing 32 move rollout result history, win loss draw encoded in 2 bits. Use N<32 game results to produce Moving average. Surprise function= Constant * delta(win rate, moving average of win rate). Exploration is now increased a little, depending on Constant value. In your case, thought to be bad move, finds good a bit later, we are now promoting node expansion.
Parent - - By h.g.muller (****) [nl] Date 2017-12-27 18:28
The parent doesn't necessarily look good when a move in the current node superficially looks bad. There could be other moves that score neutral.
Parent - - By Chris Whittington (**) [fr] Date 2017-12-27 19:05 Edited 2017-12-27 20:38
ok, switching to pedantic levels of accurate word use, I alter:

from: "However, this is compensated to some extent because the parent move will look good, and the UCT will tend to increase playouts of the parent."

to: "However, this is compensated to some extent because the parent move will look better than it did before, and the UCT will tend to increase playouts of the parent."

all other things being equal and so on and so forth.

Good god man, I give you descriptions of beauty and clarity, and a proposed modification of UCT algorithm of genius and originality; and you return me a grunt!
Parent - - By udokaiser (**) Date 2018-01-07 10:03
Guys you are very entertaining!
Parent - By RFK (Gold) Date 2018-01-10 07:35
Both these programmers  have rather eccentric personalities on top of that most of what they are discussing re: programming, holds little to no interest for me and/or goes right over my head! With that said, I have no idea who has a corner on substance and who is arguing problematic hypothetical potentialities. 

However, I do enjoy the products that both these gentleman have produced. I  thoroughly enjoy playing against Chess Tal and using Winboard Chess variant!
Up Topic The Rybka Lounge / Computer Chess / MCTS weakness wrt AB (via Daniel Shawul)

Powered by mwForum 2.27.4 © 1999-2012 Markus Wichitill