Othello Project:
Variations on Fitness

by: Dave Pisapia


Results and Analysis


background information:

  • Information on Othello the game
  • Information on the given Othello GP implementation

    what I did:

    The purpose of this project is to address a particular issue which is important to all genetic programming applications:
    How does the chosen fitness measure affect a search through the hypothesis space?

    A secondary purpose, motivated primarily by the ease at which it could be done, was to learn about the effects of varying the number of training generations, and to see how this factor might relate to over-fitting the training situation.

    why I did it:

    The fundamental driving force of evolutionary computing, that which allows one population of hypotheses to evolve into a different, "better" population is the measurement of the quality of each individual hypothesis within a given generation of the population. That is, drawing directly from Darwin, those hypotheses which are best "fit" to survive whatever test conditions to which they are subjected will have a greater chance of influencing a subsequent generation of the population, and, therefore, the population itself will have a greater chance of producing an individual which is adequately fit overall. The key idea is that without a measurement of the hypothesis, and likewise without a measurer, a population is left to grow on its own accord without regard to its environment, or in the case of machine learning, without regard to the task to be learned. The motivation therefore is to select a fitness function which will evolve* the population in such a way as to optimize the evolved hypothesis, all other things being equal.

    It occured to me, as I looked over the evaluation function within the Othello implementation, that the chosen function was completely arbitrary. Namely, the given evaluation function may not produce the best representation of the fitness of a particular hypothesis in terms of the goal of the task, which is simply to win. Thus, I decided to devise a few new ways to evaluate each hypothesis and see how this change alone would affect the generation of a gentically programmed player. The details of my implementation will be discussed in the "Approach" section.

    As will be shown, the evaluation function itself does have a significant effect on both the complexity of the resulting hypothesis, and on the performance of the hypothesis. These results will be discussed in the "Results and Analysis" section.

    *the word 'evolve' is here used in an active sense


    the current fitness function:

    The current evaluation function consists of the following two features which I sought to alter:

  • the "fitness" of a hypothesis over a single game is simply set to the opponent's score at the end of the game (consequently, a lower fitness is a better fitness, and this convention is kept throughout the experiment).
  • the overall fitness of the hypothesis is obtained by adding up the fitness per game over five games

    my specific motivation and chosen alterations:

    The way I see it, neither of these two features necessarily measures in an adequate way which hypothesis will win more often. Simply counting the number of pieces the opponent ends up with does not tell you how easy it was for the opponent to lose, or how robust the hypothesis is. In other words, it is possible that a certain hypothesis will win more often but allow the opponent to score more pieces: since the object of winning is based solely on who has the majority of pieces (not how much of the majority), a better hypothesis is not necessarily one which allows fewer opponent pieces at the end of the game, a better hypothesis is one that wins more often. As for the second feature, the chosen number of iterations, 5, is obviously completely arbitrary, and there is no reason to doubt that another number will perform better.

    To address these two issues, I devised 4 variations on the first feature, and then varied the number of iterations variations to create other variations. Thus I performed experiments over 8 variations on the original fitness function. In addition, for each variation, I captured a resultant hypothesis for numerous different generation lenghts. Thus, the general effect of varying the number of training generations was also captured by these experiments.

    For a detailed description of each variation in fitness as well as a disply of the modified code, please refer to the following table:
  • Table of Variation Descriptions

    Technical Notes
  • All tests were run using a population of 200. This seemed to be a nice compromise between speed of testing and interesting results.
  • Complete details for each generation of test run are available upon request.
  • Variation #3 got lost in the storm, renumbering to come.

    Results and Analysis

    Tables and Data:

  • All derived GP players
  • Test results of these GP players against a random player

    When analyzing the above tables, it is important to keep in mind that the default performance of the given implementation was to win approximately 41 games in 50 when played against a random player.
    The following observations were made concerning the data:

  • Most importantly, all variations did as well or better than the default fitness function. In particular, Variation 6 exhibited 45, 46, and 47 wins for several test runs of varying generation length.
  • Whereas the default fitness measure consistantly produced only one-node expressions (namely: "white") as the resultant hypothesis irregardless of the test run number of generations, the variations produced more interesting, more complex, and depper hypothesis representations. These are by no means inherently better, but surely, a hypothesis would need to be more complex thatn "white" to perform well against a non-random opponent.
  • Of particular interest regarding the above point is Variation 4. Although this variation was perhaps closest conceptually to what the default function attempted to capture, the hypotheses created were very elaborate, almost exceedingly so. These elaborate trees still managed to perform better than the defualt measure in 3 of 5 test runs.
  • Again, regarding complexity, it should be noted, from the other side of the coin, that most hypothesis generated were quite simple (though more complex than "white"). Namely, the large majority were only two levels deep (disregarding variation 4) and the few that were more than two levels deep seemed to be minor altercations of those that were twop levels deep. (e.g. "+ * white white * white white" versus "* white white" where clearly, a factor of 2 is relatively insignificant when squaring a number which can be as large as 64).
  • Performance did not generally increase with an increased number of generations allowed for testing. This indicates one of two things: 1) over-fitting occurs, though this is unlikely to happen with a random trainer. 2) the entire process is too random to get good results. Perhaps the population base is too small, or the number of generations run is not long enough for the populations to stabalize.
  • Something which is not shown in the given data, but which I personally observed during the testing runs, was that a large number of games seemed to ouput the score: 27 to 0. That is, regardless of variation, generation or anything else...that score was by far the most frequent. ???
  • The high scores for variations 4 and 6 were quite high...in the sixties for several different generation length runs.
  • Increasing the number of games per evaluation (from 5 to 10 and 15) improved dramatically the complexity of the resulting hypothesis. Consider Variation 5 which would only output the default value "white". When this same variation was allowed to run 10 and 15 games per fitness evaluation (variations 6 and 7), the resultant hypotheses became much more interesting.


  • To put a capital "P" on perspective: Perspective.
  • It is obvious from these experiments that the fitness function is an important consideration to make when determining a proper implementation for a genetic algorithm. It raises the issue of bias, and how much bias we introduce purely by defining the fitness function the way we do...after all, it is the fitness funtion which drives the entire process.
  • Regarding this particular experiment, in this particularly narrow set of conditions, it was shown that altering the fitness function to include more information on the entire game rather than just the end of the game improved performance. It was also shown that conforming the finction closer to the pure goal of the task (i.e. Variation 6, 7) greatly improved performance (perhaps by reducing bias).
  • In general, evaluating fitness for a given hypothesis in a population over a greater number of games improves the complexity of the algorithm's resultant hypothesis.
  • The experiments also showed that an increased number of generation runs does not necessarily improve performance.
  • Although the results revealed some interesting observations, one wonders how much better a system trained with say, Edgar would do versus this one trained with a random player. Still, the random player provided a nice control with which to test the fitness function.

    Dave Pisapia -- April 4, 2000-- djp26@coloumbia.edu

    Machine Learning, cs4771
    Professor Siegel
    Columbia University