Columbia University

NY, NY 10027

mk463@columbia.edu

Finally, our final evaluation of our GP players is to see what percentage of games they win against random players. It would be interesting to see some more information in our evaluation than just the percentage of wins. Taking the analogy of chess, we know that grand masters playing against novice players should be able to win within 15 moves or so, and thus we have another method of evaluation which can let us see how good our grand master is. We want to do the equivalent in Othello.

Both Playing against randomized players alone, or playing against Edgar alone are lacking as good training for our GP Player. What we like to do is use a combination of Edgar and randomized players for our GP's to measure up against. So while now each player in the population plays 5 games against a random player, we will now randomly choose who to play against (either Edgar or the Random player) for each of the five games for the performance measure. Assuming then that Edgar plays much better than the Random player; since some players playing more against Edgar will do worse, than those players will probably have a poorer performance measure than other players. This will introduce a lot more noise in the fitness measure.

Since we would like some more insight as to how well our GP players do, beyond the percentage of wins, we would like to evaluate how well they win or lose to their opponents. Now this is a bit tricky since we're not sure if this is indeed a good measure. For example, the best strategy of winning Othello would be to be to deliberately lose and then with the final 6 moves or so, come ahead with only a 5 point lead. While the lead would be only be five points, if this strategy could consistently beat any player no matter how good, then that strategy or algorithm would evaluate as well it really should by using point spreads as well as percentage of wins for our evaluation. However, assuming that this is not the case, that the best strategies are those in which our player annihilates its opponent by a huge amount, then indeed the players with the best point spread will be the one to consistently win the most games.

- Training Took Forever.
- Different Fitness Measures.

The addition of the primitive "black_long" and "white_long" which computed the longest chain of black or white pieces added 256 additional steps to the UpdateBoard function, (checking row-wise, column-wise, diagonal-left-wise, diagonal-right-wise.) Especially with the last experiment where we training against Edgar, training took even longer, and since our training kept getting interrupted, we have uneven sets of experiment data. The most from Part I adding new primitives, the second greatest from Part 0 (Baseline - no additions. The only reason we have less GP's from here to test out is that Part I even though it took less to compute was that it got interrupted too often.) And the least from Part II training against Edgar. Part III was made so that running the previous two parts would automatically give the evaluation in terms of total score as well as just the win percentage.

Since in Part II we trained against Edgar, and not juts a random player, the GP's produced may have a higher (worse) fitness measure but indeed they are better players overall. We have to rely on the end - evaluation were the players were tested in 50 games against Random and Randomized Edgar players.

Run Number | Generation Number | Best Player In Run | GP Function | Against Random Players | Against Randomized Edgar Players | ||
---|---|---|---|---|---|---|---|

GP Win Percentage | GP Total Score Percentage* | GP Win Percentage | GP Total Score Percentage* | ||||

1 | 2 | NO | - black_edges + 10 / black_near_corners black | 74% | 62% | 32% | 40% |

3 | 4 | YES | + * + black_near_corners white - black_corners 10 / / / white white_corners + black white_corners * 10 white_corners | 84% | 61% | 26% | 42% |

4 | 1 | NO | + * black_edges black_corners black | 80% | 63% | 16% | 37% |

* - AVG percentage of board occupied by black pieces at end of game.

It is interesting to see that players which have a higher win percentage with Edgar have a much lower percentage with Random Players and vice versa. What seems good for one, seems bad for the other or, some are just overspecialized for beating Edgar. Now we do "okay" against Random Players. Winning sometimes 84% of the time. However well we do against Random Players, seems to be roughly the same as how we do against the big guns, Edgar. Meaning as we are to Random players, so is Edgar to us. We win 75% to 85% of the time against Random Players. Edgar wins 68% to 85% of the time against us. The Total Score Percentage is roughly the same, Against Random Players we have 60% of the points and yet win most of the time with that, Edgar against us has 60% of the points and yet wins most of the time with that.

Run Number | Generation Number | Best Player In Run | GP Function | Against Random Players | Against Randomized Edgar Players | ||
---|---|---|---|---|---|---|---|

GP Win Percentage | GP Total Score Percentage* | GP Win Percentage | GP Total Score Percentage* | ||||

1 | 8 | NO | + / black white_corners white_near_edges | 66% | 58% | 18% | 40% |

2 | 4 | NO | / * black_edges black_long white_edges | 68% | 61% | 24% | 37% |

2 | 8 | NO | + black_long black_edges | 72% | 61% | 10% | 34% |

Frankly the performance is lackluster. None of the GP players produced do anywhere near as well as the Baseline which as I've stated earlier is strange. And even though some of the players have interesting algorithms when we think about what strategy the GP player is following, none of them work that well. Especially strange is that we have a player (best in its generation like all these players), has a very high percentage against random players (relative to the rest of the players) using the new primitive black_long (Player 3 in the table), and yet has a percentage win of 10%, which is appalling. Even the GP Total Score Percentage which usually stays to around 40%, for the worse strategy in every experiment, here drops to 34%. And what is really weird is that I understand the strategy the GP suggests, I think I use that strategy when I play (which may explain why I lose), and yet it does so badly. (The player is black_edges + black_long, which suggests that when I'm trying to get edge pieces, it would be advisable to clump the edges together. A very valid strategy in my opinion.)

Run Number | Generation Number | Best Player In Run | GP Function | Against Random Players | Against Randomized Edgar Players | ||
---|---|---|---|---|---|---|---|

GP Win Percentage | GP Total Score Percentage* | GP Win Percentage | GP Total Score Percentage* | ||||

1 | 3 | NO | + + white_near_edges black black_corners | 90% | 66% | 34% | 44% |

1 | 7 | NO | + + white_near_edges black black | 74% | 61% | 16% | 39% |

Run 1 Generation 3 has the highest evaluation scores of any GP player produced and that is in every category, whether against Random Players, or against Edgar, whether we're dealing with Win Percentage or Total Score Percentage. And it was generated with only three generations. In addition, Run 1 Generation 7 has very high test scores, (at least compared to New Primitives.) While the time took longer to train, I believe the GP's selected were probably on the whole better players because of the experience of having trained both against expert and random players. The fact that we didn't hit Terminal Fitness was expected now that we were training against Edgar and were expecting to lose more often.

Click here to see my best GP (Variety Training, Run #1, Generation #3) in action against

Last modified Nov 18, 1997 by Monty Kahan mk463@columbia.edu