Machine Learning

Othello Project Report

Monty Kahan

Columbia University
NY, NY 10027


The Problem(s)

I have been given a Genetic Programming system with a small number of "Othello" primitives that can potentially produce a GP function tree that can play Othello well against a Random Player. The system uses a set of primitives that hopefully describes the Othello board in an accurate, and complete manner. The GP, based on the board state which described that set of primitives, then is asked to evaluate the n possible moves from that state. Since it would be difficult to describe every board state, our choice of primitives are guesses as to what set of primitives can still be small, and yet describe the board in such a manner that our GP player will be able to calculate what should be the next move. This set of primitives may be too small to do the job. Furthermore, as of present the GP system evaluates the players in its population by having them play against a random player. Now real life experience tells us that in order to improve at a game we should play a broad variety of players, who are all experts. Now we have been supplied with Edgar, who is an expert but playing against him alone would mean we would produce expert anti-Edgar Othello players, and missing the broad variety. Playing against random players alone, would be equivalent of playing chess against only novices, we would become slightly better but we would miss having the experience of grand masters, and seeing their strategies in action.

Finally, our final evaluation of our GP players is to see what percentage of games they win against random players. It would be interesting to see some more information in our evaluation than just the percentage of wins. Taking the analogy of chess, we know that grand masters playing against novice players should be able to win within 15 moves or so, and thus we have another method of evaluation which can let us see how good our grand master is. We want to do the equivalent in Othello.

The Solution(s)

Since the GP's search space is large and its learning method is very random, it would be useful to see if the addition of two primitives, chosen by me, would help the system along. Specifically, I added a primitive for how many pieces are near edges, and another primitive for the size of the longest consecutive row/column/diagonal on the board. Since this was done for both White and Black a total of four primitives were added to the system. Near edges, are like near corners, bad for the player, but not as bad as near corners. And it's not absolutely bad as near corners, from experience we know that a near edge piece can be useful. The second primitive works on the concept that if there are long strings of your opponent's pieces that means a good move would be to place your piece somewhere in the middle of that string and then it would be possible later to put your pieces on either side of the string to convert the entire string to your color. Placing your piece at the edge of the string would mean that in the next move your opponent can simply place his/her piece after you piece, making your move fruitless.

Both Playing against randomized players alone, or playing against Edgar alone are lacking as good training for our GP Player. What we like to do is use a combination of Edgar and randomized players for our GP's to measure up against. So while now each player in the population plays 5 games against a random player, we will now randomly choose who to play against (either Edgar or the Random player) for each of the five games for the performance measure. Assuming then that Edgar plays much better than the Random player; since some players playing more against Edgar will do worse, than those players will probably have a poorer performance measure than other players. This will introduce a lot more noise in the fitness measure.

Since we would like some more insight as to how well our GP players do, beyond the percentage of wins, we would like to evaluate how well they win or lose to their opponents. Now this is a bit tricky since we're not sure if this is indeed a good measure. For example, the best strategy of winning Othello would be to be to deliberately lose and then with the final 6 moves or so, come ahead with only a 5 point lead. While the lead would be only be five points, if this strategy could consistently beat any player no matter how good, then that strategy or algorithm would evaluate as well it really should by using point spreads as well as percentage of wins for our evaluation. However, assuming that this is not the case, that the best strategies are those in which our player annihilates its opponent by a huge amount, then indeed the players with the best point spread will be the one to consistently win the most games.


New Primitives

I. Near Edges

In order to give the GP system an extra boost two new primitives were added, in the hopes that an even more complete and detailed representation of the board would help the system create better Othello players. The first primitive represents near edges. was modified to calculate the num_white_near_edges and num_black_near_edges. Finally near_edges (black and white) was added to the node set of the GP system and evaluate functions for the GP Players. The hopes for this primitive was following the strategy that edges controlled the board, and thus the it was beneficial for your opponents to control the squares right near the edge, as then you had a chance of getting your pieces on the edge. Also this primitive works very with the near_corners primitive. In general experience says the most important pieces are the corners (or having at least two of them is very important.) So controlling near the corners is very bad. However experience also says that the diagonal "near corner" square is worse than the horizontal or vertical "near corner" squares. In combination with the primitive "near edge" which is also from experience a bad position to control, the square occupying the diagonal is "near corner" and "near edge" while the horizontal and vertical "near corners" are merely "near corner". Near edges is not always bad, sometimes experience shows its okay and even sometimes good to control, but we can let the GP system decide if using the primitive helps produce a better player.

II. Long Consecutive Chains

Once again I'll be using experience to create a primitive that will hopefully will be able to give the system a strategic piece of knowledge about the game board. Here I assumed that it was always best to try to cut long chains of your opponent's pieces. Also when determining how to control the edge, its always best to put the edges pieces together in a chain, rather than leaves holes in the middle since your opponent can then place a piece in that hole and when you reach the "near corner" position, you have effectively given your opponent the corner. However I decided for simplicity sake that we would only concern ourselves with the longest chain on the board for both white and black. So OthelloBoard now updates a num_white_long and num_black_long primitive by going trough the board column by column, row by row, diagonal by diagonal. (The diagonals are split into two, those that go from top right to bottom left and those that go from top left to bottom right). It finds the longest chain of each color. The primitive was then added to the node set and the evaluate functions in the GP system and GP player classes.

Variety Training

In order for the system to measure the performance of the players in the population with a wide variety of players, we want our GP players to play both random and Edgar players for our fitness measure. In order to do so, we have to remember that Edgar is deterministic and so is our GP player, so to we have to randomize Edgar a little. To do that we had the evalMove function return a random number for any GameTime less than sixteen, (meaning the first 8 moves of Edgar for any game would be random). We further decided that the choice of whether to play against Edgar or a Random Player was itself a random choice, (a number was chosen between 0 and 1, less than .5 meant Random, more meant Edgar.) This actually meant the fitness measure could be slightly skewed by players who played more Edgar games than Random games (assuming that Edgar was better than a random player.) However than meant the system may be a bit more random in selecting the best GP player for selection to the next generation. (Note: Originally a new EdgarPlayer object was created for every game but that means a file read for every object, and that resulted in the system getting too many file errors. Now all Edgar games are played with the same Edgar object.)

Evaluating Using Score Totals

We are also interested in seeing how well or badly our GP players win or lose respectively. Here we will modify OthelloCache to have our GP players play 50 games against Edgar (randomized of course since otherwise we could play 1 game) and fifty games against Random Players, outputting the total scores as well as how many they win or lose out of the total games played. We can see if the percentages match up to be roughly equal. If our GP is losing close to 100% or winning 100% of the time we could use the total score percentages to see if they're only just losing all the games or only just winning all the games by a hairbreadth. Or if the losses and wins have large point spreads. Using the Random players as well as Edgar for our evaluation will make sure that we haven't developed some special expert anti-Edgar player who will do poorly playing against anyone else.



The results are slightly varied and I feel the need to remind people of two issues they should keep in mind as they view these results.
  1. Training Took Forever.

  2. The addition of the primitive "black_long" and "white_long" which computed the longest chain of black or white pieces added 256 additional steps to the UpdateBoard function, (checking row-wise, column-wise, diagonal-left-wise, diagonal-right-wise.) Especially with the last experiment where we training against Edgar, training took even longer, and since our training kept getting interrupted, we have uneven sets of experiment data. The most from Part I adding new primitives, the second greatest from Part 0 (Baseline - no additions. The only reason we have less GP's from here to test out is that Part I even though it took less to compute was that it got interrupted too often.) And the least from Part II training against Edgar. Part III was made so that running the previous two parts would automatically give the evaluation in terms of total score as well as just the win percentage.
  3. Different Fitness Measures.

  4. Since in Part II we trained against Edgar, and not juts a random player, the GP's produced may have a higher (worse) fitness measure but indeed they are better players overall. We have to rely on the end - evaluation were the players were tested in 50 games against Random and Randomized Edgar players.


The Baseline had no additions. Four runs in total. All the generations of all the runs added up to 15. Because of the paucity of data I chose to select those best of generations which did well in training. From among these best of generations I selected those that did well against a set of test examples for presentation here. (a biased measure but one I had to resort to, becasue the runs took so long and very rarely ever beat the terminating fitness measure.) Now when we chose which GP to present here, we ignored all GP's which were obvious such as "black" or "-white", or anything that had "black" or "-white" as the only important terms, even if they were a best player in a run, e.g. Run #2's player. Out of the three presented here, two players stand out, as having especially good scores (relatively) against either Random Players or Edgar. (All percentages are done out of playing 50 games.)
Run Number Generation Number Best Player In Run GP Function Against Random Players Against Randomized Edgar Players
GP Win Percentage GP Total Score Percentage* GP Win Percentage GP Total Score Percentage*
1 2 NO - black_edges + 10 / black_near_corners black 74% 62% 32% 40%
3 4 YES + * + black_near_corners white - black_corners 10 / / / white white_corners + black white_corners * 10 white_corners  84% 61% 26% 42%
4 1 NO + * black_edges black_corners black  80% 63% 16% 37%

* - AVG percentage of board occupied by black pieces at end of game.

It is interesting to see that players which have a higher win percentage with Edgar have a much lower percentage with Random Players and vice versa. What seems good for one, seems bad for the other or, some are just overspecialized for beating Edgar. Now we do "okay" against Random Players. Winning sometimes 84% of the time. However well we do against Random Players, seems to be roughly the same as how we do against the big guns, Edgar. Meaning as we are to Random players, so is Edgar to us. We win 75% to 85% of the time against Random Players. Edgar wins 68% to 85% of the time against us. The Total Score Percentage is roughly the same, Against Random Players we have 60% of the points and yet win most of the time with that, Edgar against us has 60% of the points and yet wins most of the time with that.

New Primitives

New Primitives had 2 runs only, however they ran for much longer. The first one had 12 generations before terminating because of a system shut-down. The second run went for 18 generations before it was interrupted, so it never hit terminal fitness. In general we would expect that since we have only added two new primitives and nothing else, we would have done as well as the Baseline since it could just ignore the new primitives and get GP's as good as the Baseline. However we see that it was only able to hit terminal fitness once while the Baseline hit it four times with much less generations. For this I have no explanation. What is presented here are only those GP players produced which contain the new primitives. There were those GP players which seemed to do well, yet didn't contain the new primitives, that were also produced, though they are not presented here. (Frankly none of them seemed to do that well, especially those that finished best in the run.)
Run Number Generation Number Best Player In Run GP Function Against Random Players Against Randomized Edgar Players
GP Win Percentage GP Total Score Percentage* GP Win Percentage GP Total Score Percentage*
1 8 NO + / black white_corners white_near_edges  66% 58% 18% 40%
2 4 NO / * black_edges black_long white_edges  68% 61% 24% 37%
2 8 NO + black_long black_edges  72% 61% 10% 34%
* - AVG percentage of board occupied by black pieces at end of game.

Frankly the performance is lackluster. None of the GP players produced do anywhere near as well as the Baseline which as I've stated earlier is strange. And even though some of the players have interesting algorithms when we think about what strategy the GP player is following, none of them work that well. Especially strange is that we have a player (best in its generation like all these players), has a very high percentage against random players (relative to the rest of the players) using the new primitive black_long (Player 3 in the table), and yet has a percentage win of 10%, which is appalling. Even the GP Total Score Percentage which usually stays to around 40%, for the worse strategy in every experiment, here drops to 34%. And what is really weird is that I understand the strategy the GP suggests, I think I use that strategy when I play (which may explain why I lose), and yet it does so badly. (The player is black_edges + black_long, which suggests that when I'm trying to get edge pieces, it would be advisable to clump the edges together. A very valid strategy in my opinion.)

Variety Training

Using the Variety Training took an incredibly long amount of time, and so we were only able to get one run with nine generations. This run did terminate with a GP player with a fitness of 40. Subjectively while looking at the GP players produced, I was impressed. I am aware that what I'm saying here is both subjective and inexact, but in any case, the GP players produced by the first two methods had a lot of players that were just asking to be wiped by any intelligent player. Here almost all our players were combinations of good things to have when playing, black, black_edges, black_corners. All of them seemed to boast good strategies. Many of them were slight alterations of each other, seemingly it found something good, that was also good with some modifications, which could mean it was good over a larger set of test cases. And the GPs produced did the best overall at least from what I tested in my evaluation over 50 games. I present two here.
Run Number Generation Number Best Player In Run GP Function Against Random Players Against Randomized Edgar Players
GP Win Percentage GP Total Score Percentage* GP Win Percentage GP Total Score Percentage*
1 3 NO + + white_near_edges black black_corners  90% 66% 34% 44%
1 7 NO + + white_near_edges black black  74% 61% 16% 39%
* - AVG percentage of board occupied by black pieces at end of game.

Run 1 Generation 3 has the highest evaluation scores of any GP player produced and that is in every category, whether against Random Players, or against Edgar, whether we're dealing with Win Percentage or Total Score Percentage. And it was generated with only three generations. In addition, Run 1 Generation 7 has very high test scores, (at least compared to New Primitives.) While the time took longer to train, I believe the GP's selected were probably on the whole better players because of the experience of having trained both against expert and random players. The fact that we didn't hit Terminal Fitness was expected now that we were training against Edgar and were expecting to lose more often.

New Evaluation Method

The data from using the Total Score Percentage is already presented. Frankly, it seems that Score Percentage roughly follows a 60-40 split, 60% to the better strategy. So as an evaluation tool it doesn't work. Which doesn't mean that the experiment is a failure. What is important here is to realize that Total Scores are used as a fitness measure throughout the learning process, but the scores here seem to say that in general its more important to consider how many times one has won and not by how much. If we had the time we should be training using 50 games and not 5 games per generation for each individual, and use Win Percentage for our fitness measure and not the Total Score Percentage. No matter how good or bad you are, it seems that you win/lose only by roughly the same amount of points.


Most of what I believe the data has "told" me, I've already noted in the Results section. I would just like to sum up here what I've already said. Once again I can't explain how the Baseline did so much better than New Primitives. The only thing I can think of is that my new primitives do not accomplish anything and they confused the GP system with useless information. They take a long time to compute and may not really encapsulate anything which the GP can really use to evalute the board well. The GP is spending time trying to create players with these new primitves making the system go that much slower in creating a good Othello player. New Primitives did not do well, quite poorly in fact, though it did beat the random player, though against Edgar, one "good" GP player won only 10% of the time. Variety Training seemed to help a lot and we did beat the Baseline, and the GP players produced looked on the whole all strategically sound and promising. The New Evaluation Method didn't seem to give any new insights except for the fact that we should be looking for a new fitness measure, and that if we had the time we would play more games in training and use the win percentage. 

Click here to see my best GP (Variety Training, Run #1, Generation #3) in action against

Last modified Nov 18, 1997 by
Monty Kahan