William Bauder's Genetic Othello

Tables | Graphs | Play My Players


Machine Learning - CS4771
Homework #4 - Fall 1997

Introduction

This project based upon a genetic programming project for Othello that was provided to us for experimentation. Along with the genetic programming package for Othello, we were also given "Edgar"--a very talented Othello player (I have yet to beat Edgar myself). The primary goal of this project is to explore genetic programming and to attempt to improve upon the results of the existing package. I have chosen two points of focus to achieve these aims.

1. Complexity and Fitness

In the existing package, the fitness measure includes adding the complexity of the individual to the raw fitness measure (based on that individual's performance against a random player). My (brief) experience with complexity implied that this approach is a double-edged sword.

For example, I had noticed the benefit in an experiment that "learned" algebraic functions. Many of the successful individuals did match the output of the target function, but were much more complex. Closer examination revealed that the successful individuals contained branches that evaluated to zero (for example, a branch that equaled x*x*x*(x-x): x-x = 0, which caused the entire branch to equal zero, irregardless of the input). When these individuals were reduced to take these zero branches into account, the equaled the target function.

On the other hand, my early experiments with the Othello package implied something completely different. Many of my early runs quickly and consistently produced one "best" individual of complexity one. Apparently, the complexity was affecting the evaluation of the population to produce an individual that was probably not the best individual in that generation, and the population tended to skew toward less complex individuals.

My solution to this problem was to attempt to scale down the complexity's contribution to overall fitness. I would attempt to find a good scale to reduce the contribution of complexity from a major indicator to a way of differentiating individuals of different lengths in a manner that would favor the smaller (meaning reduced, or all branches can factor in the game strategy) individuals over larger individuals that performed just as well otherwise.


2. Evaluation by Example

My second aim was to improve the results of genetic players--in terms of both time (required to evolve) and game performance. The current Othello players were respectable - but not very good. Ideally, I would have liked to evolve a system that was capable of beating Edgar (Astro Teller's Othello Player), but a significant improvement over the existing players, evolved in a reasonable amount of time, would be quite sufficient.

I was provided with one previously created and tested player. I tested this against 500 random players to map out a few benchmarks. In this test, the player won 31.2% of its games, with varying performance. Some of the games were very decisive wins or losses, while others where fairly close. I also tested the current package out to see what other players it would produce. One individual appeared often, and seemed to perform well. It was a single node, returning the value of the color that the player is playing, effectively attempting to maximize the number of pieces of that color at every move. This player, playing against 500 random players, won 46.4% of its games, again with widely varying results.

Although I thought that alot of the current results were due to the complexity factor, I also wanted to try to vary the fitness measure of the individual to better represent what had needed to be learned. I also wanted to attempt to incorporate some of the Edgar's skill into my players. The existing fitness measure at the end of the game was simply based on the opponent's score at the end of the game (in this case, lower fitness measures mapped to "fitter" individuals, while higher fitness measures indicated poor performance). Instead of basing my results on the end of the game, I decided to try an evaluation that found the difference between Edgar's ranking of each the players' moves and the ranking of what Edgar thought was the best move, averaged over the course of the game.


The Experiments

In the early, exploratory part of the experiments, I ran very small test runs, usually two or three runs of twenty generations each, using a population of fifty. When I finalized what I was attempting, I ran one run on a population of 200 for 100 generations in two of my tests, and to 200 generations in the third. Each individual in every test was tested in five games, except for the last game, which was tested in ten games. Most of the tests (including all of the final tests) were tested against random players - Edgar and the genetic players were not random, and would play the same game repeatedly.

The Complexity Test

I began to attack the complexity problem by reducing the factor that complexity played in the overall fitness of an individual. By evaluating the fitness of an individual by adding the actual complexity to the evaluation of play performance made enormous differences in the final fitness measure of that individual; individuals that performed well in the games were often categorized as less fit than individuals that did not perform as well, but were much less complex. I began added the result of the complexity over a constant to the fitness measures. I tried using 2, 5, 10, and 20 as this constant in tests to twenty generations, but the effect of the complexity were still quite noticiable. Finally, I settled on dividing the complexity by 50. Although more intensive tests would prove otherwise, it did not seem to effect the population greatly in tests over twenty generations.

I didn't want complexity to play a major factor in the fitness of an individual. I was hoping to reduce the factor that fitness played enough that it would not effect the comparitive fitness of individuals that performed differently in play. I did not want a player that was not as good at Othello but much less complex to be rated as "fitter" than an individual that performed better in the game; I wanted complexity to act as a tiebreaker of sorts between two (possible functionally equivalent) individuals that performed the same while playing, but had differing levels of complexity. I hoped to accomplish this by reducing the amount that complexity added to fitness to the approximate order of differing fitness.

The final factor of complexity/50 seemed to suit this well. Without computing the mean of fitness and complexity over the entire test, I would say that most of the values of complexity ranged somewhere between 10 and 25, while most of the fitness measure ran between 100 and 170. Since these complexities were divided by 50, most of the fitness measures were increased by between .2 and .5, or roughly on the order of 10^-3. With my new game performance measure in place, I ran two tests -- one using this reduced complexity as a factor in fitness, and one that ignored complexity altogether -- to 100 generations with a population size of 200. These tests were actually quite successful -- a striking difference in the average complexity , the average fitness, and the fitness of the population over time resulted. In the higher generations, the test with complexity grew to more examples of a few number of individuals (three were very common, lead by an individual with a complexity of three), while the average fitness of the population began to stagnate as the variety dipped lower. Meanwhile, the test without complexity grew a population of very large, complex individuals, but the variety actually increased to 100% (a population consisting entirely of unique individuals), while the average fitness of the population continued to get better, finishing substantially lower than the fitness of the population that used complexity as part of fitness.


The Evaluation Test

The current evaluation for game performance was the score of the individual's opponent. Some of these players performed well, but I wanted to better there performance and add some level of strategy to the genetic players. Since I am not an expert Othello player, I decided to use a more knowledgeable resource in Othello strategy -- Edgar. Edgar basically works by evaluating and assigning a value to all possible moves, and picking the one that ranked the best. However, Edgar did this by evaluating each move individually against the current board state, so I was able to pass the move that the genetic player chose to Edgar for evaluation. I would then have Edgar find his his own move, and return the ranking of his move. The fitness of a particular genetic player's move was the difference between Edgar's evaluation of that move and the evaluation of the move that Edgar would have chosen. These evaluations were summed over the course of each game, and were divided by the number of moves at the end of the game (to avoid penalizing players that made more moves). My initial tests showed that these measures were often quite low, so I multiplied them by a factor of ten to scale them to more practical value -- I didn't want to deal too much in decimal fitness measures, and the fitness measures were large enough to use the same complexity scheme that I currently had in place.

I tested this concurrently with the complexity changes in the test that I have outlined in the previous section. Once I had evolved the best players, I chose several that had the best fitness for performance testing against random players. I chose the six top players from the test that did not use complexity as part of the fitness measure, and eight players from the test that did use complexity as part of the fitness measure. (I had originally intended to test five players, but settled on six and eight because last few players were very close in terms of fitness.) I tested each set of players against a random player for fifty games, and if there was one that was clearly performing worse than the others (in terms of winning percentage), I eliminated it from the set. I eventually tested the best players for 500 games. The results were rather discouraging.

The players that were evaluated without complexity did the best, adding some weight to my argument about complexity. However, they did not perform that well overall. The two best individuals from the test that did not use complexity lost more than they won against random players (36.6% and 36.8%). While this was better than the existing complex player (31.2%), this was still much worse than the player that just attempted to maximize its pieces at every move (46.4%). While this did represent an improvement over the current method, it was still very poor. A very simple heuristic was outperforming it significantly, and simply picking a random move was winning almost two out of three games against it.

The players that were evaluated with complexity as a factor in fitness were much worse. The best player from that test won 32.6% of its games, but was taken from the second of 100 generations--long before the population really had time to evolve. The next best player managed to win 25.6% of its games against random players, the third best player won 23.6% of its games, while the last five of the best players all had winning percentages in the teens. (However, the test using complexity was used more for determining the role of complexity in evolving genetic players than to evaluate the fitness measure.)

I examined the performance of the best players. What I discovered by looking over their results and their games was encouraging. Although they lost more games than they won, they usually had fairly close losses, and very decisive victories. While some of this can be attributed to the random players, I felt that this indicated that they had learned something. Although they had not learned all of Edgar's skill, they had acquired some of Edgar's ability to take advantage of opponent's mistakes. They simply needed to learn to win. This was understandable, since the current evaluation did not take win or loss into account, but merely how well Edgar thought that they had played.

I devised a modified fitness measure, based on what I had observed. I decided not to use complexity as part of the fitness, since the previous test provided me with strong evidence against this increasing performance. I also increased the number of games that each individual played from five to ten in an attempt to decrease the amount of noise from the random players (Toward the end of the test that used complexity, I found that two or three players made up the entire population, and accounted for a wide range of fitness--most likely due to one getting a series of bad random players, and another getting a series of good random players. I hoped that by increasing the number of games, each individual would face a better variety of random players, and the chances of an individual being evaluated as better or worse than it actually was would decrease.) Finally, I modified the fitness measure itself. The new fitness would be the previous fitness (the average of the difference between Edgar's evaluation of the player's move and Edgar's best move, multiplied by ten) multiplied by the ratio of the opponent's score over the player's score at the end of the game. (This was quickly changed to the opponent's score plus one divided by the player's score plus one, in order to avoid zeros in the numerator or denominator.) The theory behind this approach was to continue taking Edgar's evaluation into account, while adding an incentive to win. Edgar's evaluation would be scaled up to a significantly larger number for a loss against the random player, and would be scaled down significantly for a victory against the random player. Furthermore, the player would be penalized very severely for losing badly, and rewarded greatly for a decisive victory.

This method seems to have worked the best from my tests against the random players. Out of the ten players that I tested (there wasn't much difference in the fitness measures of the best players), four of the players score above 40% (46%, 44%, 43.2%, and 41.4%), the best of which came very close to the strategy that maximized the number of pieces at every move (46.4%, or a difference of 0.4%). This was also significantly better than the best individuals produced with the other fitness measure. While it was not able to win more than it lost against the random player, and equalled the winning percentage of the simply heuristic player, it was a significant improvement over the existing strategies.


Conclusions

I think that the experiment was successful more than unsuccessful, although I did not accomplish all that I wanted to do to the degree that I wanted. The complexity test was a definate success -- the results proved my initial theory that using complexity as a factor in the fitness measure was not very good, although it did to an even greater extent than I had expected. I recommend taking a look at the graphs illustrating the results of the complexity test. In particular, the graphs of average fitness, average complexity, and variety illustrate the point the best. (You may also view the tables used to generate the graphs, although they are a little harder to read.) However, the most striking that I took from the complexity test had nothing to do with complexity. Instead, it just how sensative a population of genetic programs are to any variable. The overall effect of complexity was what I would have previously classified as negligible -- yet over time, it became a strong, perhaps even defining, feature of the population that used complexity in its evaluation.

The test of the different evaluation functions was not quite as successful. My target was to create a player that could compete on Edgar's level, but I was only able to produce players that could compete with a very simple heuristic, and could almost hold there own against a player that randomly chose legal moves. However, I do feel that it was successful for several reason. The rising winning percentage over my test cases, compared to the "best" player that I had been given to start with, indicates that the fitness measure had been improved, that something had actually been learned. Furthermore, the final test (using the fitness measure that multiplied the average evaluation by the ratio of the score) did beat Edgar decisively in one respect. When I tested Edgar in 500 games against the random player, it had a winning percentage of 96%. However, it never defeated a random player before the board was filled. Meanwhile, all four of the best players in the last test had multiple games that ended before the board was filled (winning before filling the game board between 3.6% and 5.6% of the time, with two players each winning one game by the score of 16-0). While none of the players were able to consistently defeat the random player, they often had very strong wins, indicating that progress has been made.


William Bauder
wjb14@columbia.edu