William Bauder's Genetic Othello

<title>William Bauder's Genetic Othello</title>
<center>
<h1>William Bauder's Genetic Othello</h1>
<img src="ilose.jpg"><p>
<a href="tables.html">Tables</a> |
<a href="graphs.html">Graphs</a> | 
<a href="players.html">Play My Players</a>
</center>
<hr>
Machine Learning - CS4771<br>
Homework #4 - Fall 1997<br>
<hr>
<h2>Introduction</h2>
This project based upon a genetic programming project for Othello that 
was provided to us for experimentation.  Along with the genetic 
programming package for Othello, we were also given "Edgar"--a very 
talented Othello player (I have yet to beat Edgar myself).  The 
primary goal of this project is to explore genetic programming and to
attempt to improve upon the results of the existing package.  I have
chosen two points of focus to achieve these aims.
<hr>
<h3>1. Complexity and Fitness</h3>
In the existing package, the fitness measure includes adding the 
complexity of the individual to the raw fitness measure (based on that 
individual's performance against a random player).  My (brief) 
experience with complexity implied that this approach is a 
double-edged sword.<p>  
For example, I had noticed the benefit in an 
experiment that "learned" algebraic functions.  Many of the successful 
individuals did match the output of the target function, but were much 
more complex.  Closer examination revealed that the successful 
individuals contained branches that evaluated to zero (for example, 
a branch that equaled x*x*x*(x-x):  x-x = 0, which caused the entire 
branch to equal zero, irregardless of the input).  When these 
individuals were reduced to take these zero branches into account, the 
equaled the target function.<p>
On the other hand, my early experiments with the Othello package 
implied something completely different.  Many of my early runs quickly 
and consistently produced one "best" individual of complexity one.  
Apparently, the complexity was affecting the evaluation of the 
population to produce an individual that was probably not the best
individual in that generation, and the population tended to skew 
toward less complex individuals.<p>
My solution to this problem was to attempt to scale down the 
complexity's contribution to overall fitness.  I would attempt to find 
a good scale to reduce the contribution of complexity from a major 
indicator to a way of differentiating individuals of different lengths 
in a manner that would favor the smaller (meaning reduced, or all 
branches can factor in the game strategy) individuals over larger 
individuals that performed just as well otherwise.
<hr>
<h3>2. Evaluation by Example</h3>
My second aim was to improve the results of genetic players--in terms 
of both time (required to evolve) and game performance.  The current 
Othello players were respectable - but not very good.  Ideally, I 
would have liked to evolve a system that was capable of beating 
Edgar (Astro Teller's Othello Player), but a significant improvement 
over the existing players, evolved in a reasonable amount of time, 
would be quite sufficient.<p>
I was provided with one previously created and tested player.  I tested 
this against 500 random players to map out a few benchmarks.  In this 
test, the player won 31.2% of its games, with varying performance.  
Some of the games were very decisive wins or losses, while others where 
fairly close.  I also tested the current package out to see what other 
players it would produce.  One individual appeared often, and seemed to 
perform well.  It was a single node, returning the value of the color 
that the player is playing, effectively attempting to maximize the number 
of pieces of that color at every move.  This player, playing against 
500 random players, won 46.4% of its games, again with widely varying 
results.<p>
Although I thought that alot of the current results were due to the 
complexity factor, I also wanted to try to vary the fitness measure of 
the individual to better represent what had needed to be learned.  I 
also wanted to attempt to incorporate some of the Edgar's skill into 
my players.  The existing fitness measure at the end of the game was 
simply based on the opponent's score at the end of the game (in this 
case, lower fitness measures mapped to "fitter" individuals, while 
higher fitness measures indicated poor performance).  Instead of basing 
my results on the end of the game, I decided to try an evaluation that 
found the difference between Edgar's ranking of each the players' moves 
and the ranking of what Edgar thought was the best move, averaged over 
the course of the game.  
<hr> 
<h2>The Experiments</h2>
In the early, exploratory part of the experiments, I ran very small 
test runs, usually two or three runs of twenty generations each, using 
a population of fifty.  When I finalized what I was attempting, I ran 
one run on a population of 200 for 100 generations in two of my tests, 
and to 200 generations in the third.  Each individual in every test 
was tested in five games, except for the last game, which was tested in 
ten games.  Most of the tests (including all of the final tests) were 
tested against random players - Edgar and the genetic players were not 
random, and would play the same game repeatedly.  
<hr>
<h3>The Complexity Test</h3>
I began to attack the complexity problem by reducing the factor that 
complexity played in the overall fitness of an individual.  By evaluating 
the fitness of an individual by adding the actual complexity to the 
evaluation of play performance made enormous differences in the final 
fitness measure of that individual; individuals that performed well in 
the games were often categorized as less fit than individuals that did 
not perform as well, but were much less complex.  I began added the 
result of the complexity over a constant to the fitness measures.  I 
tried using 2, 5, 10, and 20 as this constant in tests to twenty 
generations, but the effect of the complexity were still quite noticiable.  
Finally, I settled on dividing the complexity by 50.  Although more 
intensive tests would prove otherwise, it did not seem to effect the 
population greatly in tests over twenty generations.  <p>
I didn't want complexity to play a major factor in the fitness of an 
individual.  I was hoping to reduce the factor that fitness played 
enough that it would not effect the comparitive fitness of individuals 
that performed differently in play. I did not want a player that was not 
as good at Othello but much less complex to be rated as "fitter" than 
an individual that performed better in the game; I wanted complexity to 
act as a tiebreaker of sorts between two (possible functionally equivalent) 
individuals that performed the same while playing, but had differing 
levels of complexity.  I hoped to accomplish this by reducing the amount 
that complexity added to fitness to the approximate order of differing 
fitness.  <p>
The final factor of complexity/50 seemed to suit this well.  Without 
computing the mean of fitness and complexity over the entire test, I 
would say that most of the values of complexity ranged somewhere between 
10 and 25, while most of the fitness measure ran between 100 and 170.  
Since these complexities were divided by 50, most of the fitness measures 
were increased by between .2 and .5, or roughly on the order of 10^-3.  
With my new game performance measure in place, I ran two tests -- one 
using this reduced complexity as a factor in fitness, and one that 
ignored complexity altogether -- to 100 generations with a population 
size of 200.  These tests were actually quite successful -- a striking 
difference in the <a href="graph_avgcomplexity.html">average complexity</a>
, the <a href="graph_avgfitness.html">average fitness</a>, and the 
<a href="graph_variety.html">fitness</a> of the population over time 
resulted.  In the higher generations, the test with complexity grew to 
more examples of a few number of individuals (three were very common, 
lead by an individual with a complexity of three), while the average 
fitness of the population began to stagnate as the variety dipped lower.
Meanwhile, the test without complexity grew a population of very large, 
complex individuals, but the variety actually increased to 100% (a 
population consisting entirely of unique individuals), while the 
average fitness of the population continued to get better, finishing 
substantially lower than the fitness of the population that used 
complexity as part of fitness.

<hr>
<h3>The Evaluation Test</h3>
The current evaluation for game performance was the score of the 
individual's opponent.  Some of these players performed well, but I 
wanted to better there performance and add some level of strategy to 
the genetic players.  Since I am not an expert Othello player, I 
decided to use a more knowledgeable resource in Othello strategy 
-- Edgar.  Edgar basically works by evaluating and assigning a value 
to all possible moves, and picking the one that ranked the best.  
However, Edgar did this by evaluating each move individually against 
the current board state, so I was able to pass the move that the genetic 
player chose to Edgar for evaluation.  I would then have Edgar find his 
his own move, and return the ranking of his move.  The fitness of a 
particular genetic player's move was the difference between Edgar's 
evaluation of that move and the evaluation of the move that Edgar would 
have chosen.  These evaluations were summed over the course of each game, 
and were divided by the number of moves at the end of the game (to avoid 
penalizing players that made more moves).  My initial tests showed that 
these measures were often quite low, so I multiplied them by a factor of 
ten to scale them to more practical value -- I didn't want to deal too 
much in decimal fitness measures, and the fitness measures were large 
enough to use the same complexity scheme that I currently had in place.
<p>
I tested this concurrently with the complexity changes in the test that 
I have outlined in the previous section.  Once I had evolved the best 
players, I chose several that had the best fitness for performance testing 
against random players.  I chose the six top players from the test that 
did not use complexity as part of the fitness measure, and eight players 
from the test that did use complexity as part of the fitness measure.  
(I had originally intended to test five players, but settled on six and 
eight because last few players were very close in terms of fitness.)  I 
tested each set of players against a random player for fifty games, and 
if there was one that was clearly performing worse than the others 
(in terms of winning percentage), I eliminated it from the set.  I eventually 
tested the best players for 500 games.  The results were rather discouraging.
<p>
The players that were evaluated without complexity did the best, adding 
some weight to my argument about complexity.  However, they did not 
perform that well overall.  The two best individuals from the test 
that did not use complexity lost more than they won against random 
players (36.6% and 36.8%).  While this was better than the existing 
complex player (31.2%), this was still much worse than the player that 
just attempted to maximize its pieces at every move (46.4%).  While 
this did represent an improvement over the current method, it was still 
very poor.  A very simple heuristic was outperforming it significantly, 
and simply picking a random move was winning almost two out of three 
games against it.  <p>
The players that were evaluated with complexity as a factor in fitness 
were much worse.  The best player from that test won 32.6% of its games, 
but was taken from the second of 100 generations--long before the population 
really had time to evolve.  The next best player managed to win 25.6% 
of its games against random players, the third best player won 23.6% 
of its games, while the last five of the best players all had winning 
percentages in the teens.  (However, the test using complexity was used 
more for determining the role of complexity in evolving genetic players 
than to evaluate the fitness measure.)  
<p>
I examined the performance of the best players.  What I discovered by 
looking over their results and their games was encouraging.  Although 
they lost more games than they won, they usually had fairly close losses, 
and very decisive victories.  While some of this can be attributed to 
the random players, I felt that this indicated that they had learned something.  
Although they had not learned all of Edgar's skill, they had acquired some 
of Edgar's ability to take advantage of opponent's mistakes.  They simply 
needed to learn to win.  This was understandable, since the current 
evaluation did not take win or loss into account, but merely how well 
Edgar thought that they had played.  
<p>
I devised a modified fitness measure, based on what I had observed.  
I decided not to use complexity as part of the fitness, since the 
previous test provided me with strong evidence against this increasing 
performance.  I also increased the number of games that each individual 
played from five to ten in an attempt to decrease the amount of noise 
from the random players (Toward the end of the test that used complexity, I 
found that two or three players made up the entire population, and 
accounted for a wide range of fitness--most likely due to one getting 
a series of bad random players, and another getting a series of good 
random players.  I hoped that by increasing the number of games, each 
individual would face a better variety of random players, and the chances of
an individual being evaluated as better or worse than it actually was would 
decrease.)  Finally, I modified the fitness measure itself.  The new 
fitness would be the previous fitness (the average of the difference 
between Edgar's evaluation of the player's move and Edgar's best move,  
multiplied by ten) multiplied by the ratio of the opponent's score over 
the player's score at the end of the game.   (This was quickly changed 
to the opponent's score plus one divided by the player's score plus one, 
in order to avoid zeros in the numerator or denominator.)  The theory 
behind this approach was to continue taking Edgar's evaluation into 
account, while adding an incentive to win.  Edgar's evaluation would be 
scaled up to a significantly larger number for a loss against the 
random player, and would be scaled down significantly for a victory 
against the random player.  Furthermore, the player would be penalized 
very severely for losing badly, and rewarded greatly for a decisive 
victory.<p>
This method seems to have worked the best from my tests against the 
random players.  Out of the ten players that I tested (there wasn't 
much difference in the fitness measures of the best players), four
of the players score above 40% (46%, 44%, 43.2%, and 41.4%), the best 
of which came very close to the strategy that maximized the number of
pieces at every move (46.4%, or a difference of 0.4%).  This was also 
significantly better than the best individuals produced with the 
other fitness measure.  While it was not able to win more than it lost 
against the random player, and equalled the winning percentage of the 
simply heuristic player, it was a significant improvement over the 
existing strategies.
<hr>
<h2>Conclusions</h2>
I think that the experiment was successful more than unsuccessful, 
although I did not accomplish all that I wanted to do to the degree 
that I wanted.  The complexity test was a definate success -- the results 
proved my initial theory that using complexity as a factor in the fitness 
measure was not very good, although it did to an even greater extent than 
I had expected.  I recommend taking a look at the 
<a href="graphs.html">graphs</a> illustrating the results of the 
complexity test.  In particular, the graphs of average fitness, 
average complexity, and variety illustrate the point the best.  (You 
may also view the <a href="tables.html">tables</a> used to generate 
the graphs, although they are a little harder to read.)  However, the 
most striking that I took from the complexity test had nothing to do 
with complexity.  Instead, it just how sensative a population of 
genetic programs are to any variable.  The overall effect of complexity 
was what I would have previously classified as negligible -- yet over 
time, it became a strong, perhaps even defining, feature of the population 
that used complexity in its evaluation.  <p>
The test of the different evaluation functions was not quite as successful.  
My target was to create a player that could compete on Edgar's level, 
but I was only able to produce players that could compete with a 
very simple heuristic, and could almost hold there own against a player 
that randomly chose legal moves.  However, I do feel that it was 
successful for several reason.  The rising winning percentage over my 
test cases, compared to the "best" player that I had been given to 
start with, indicates that the fitness measure had been improved, that 
something had actually been learned.  Furthermore, the final test 
(using the fitness measure that multiplied the average evaluation by
the ratio of the score) did beat Edgar decisively in one respect.  
When I tested Edgar in 500 games against the random player, it had a 
winning percentage of 96%.  However, it never defeated a random player 
before the board was filled.  Meanwhile, all four of the best players 
in the last test had multiple games that ended before the board was 
filled (winning before filling the game board between 3.6% and 5.6% 
of the time, with two players each winning one game by the score of 
16-0).   While none of the players were able to consistently defeat 
the random player, they often had very strong wins, indicating that 
progress has been made.
<hr>
<i>
William Bauder<br>
<a href="mailto:wjb14@columbia.edu">wjb14@columbia.edu</a></i>