Catastrophes in Evolution

Eugene Mesgar
Sara Schumacher

Throughout evolution, catastrophes have changed the scope of development by suddenly changing the way in which natural selection takes place. Those phenotypes that survived best in one evironment can completely falter in another.

For this project, we tried to simulate these catastrophes in a genetic programming algorithm that plays Othello by drastically changing the way that the fitness is measured, favoring hypotheses that may not have been favored before.

Adaptability of Genetic Programming Hypotheses

Motivation
Approach
Testing
Results
Analysis
Conclusion

Motivation

The motivation behind this project is the search for the meta-trait of adaptability. In regular populations, we see the population slowly converging towards a relative notion of "fitness". For example, in a population where all the food is 6 feet above the ground, the individuals who are six feet tall or have the ability to jump are considered the "most fit". In modern America, the notion of "fitness" loosely based the ability to generate capital. In a situation like America, traits such as strength and physical robustness aren't critical components the fitness metric but rather traits such as intelligence and economic understanding emerge as key fitness critera.

Populations slowly converge towards these relative notions of fitness until a major change in the environment like a physical catastrophe completely changes what makes an individual fit. For example, imagine if a cataclysmic event destroyed all technology and set human civilization back to a medieval living style. In this situation, the investment bankers, computer scientists and lawyers would no longer be "fit" and probably the homeless, because of their survival skills would emerge as "the fittest".

The point of this argument is not to make a social argument, but rather to demonstrate that after many generations have been subjected to catastrophic events and the fitness metric constantly changed, there is an emergent quality in the remaining members of the population: adaptability and robustness.

Each of the genotypes in the remaining population was able to adapt each new "fitness metric" and emerge as a species capable of adapting to new situations. The inductive generalization is that if an individual is fit under a series of different fitness measures, the individual should also be fit in other situations and hopefully posses 'adaptability'.

In Othello, we see the convergence towards a "local relative fitness metric" in a hypothesis' ability to adapt to an individual player's strategy but fail to generalize over many players. Lacking many players to train with, we decided to vary the fitness metric and change it from generation to generation (in a cyclical manner) in hope that our player would display the emergent property of "adaptability" in general and hopefully, be able to beat a GP player trained with the regular genetic algorithm.

Approach

Fitness Metrics

We developed six different versions of what it meant to be a "good othello player." Each metric was created to value different features of an Othello players playing style
Note: In the actual implementation, the fitness metrics are reversed to 0

Standard Fitness Metric
The first metric used is the standard Othello fitness metric, i.e points. This metric is used as a baseline which should corrolate to general fitness.
Speedy Games Favored
This metric favors "fast" games. The metric is based around dividing the score by the number of moves in a game. A player who wins a game very quickly is valued much higher than a player who plays a very slow defensive game. The metric hopefully encourges an offensive strategy.
Winning Streaks favored.
This metric is based around "squaring the number of games won." Since indivual players are evaluated over five games, this "winning streak" fitness metric values a player who wins 3/5 games disporionatly more than a player that wins 2/5 games. The hope behind this fitness metric is that it will be easier for highly fit individuals to differeniate themselves in the population.
Slow Games Favored
In this situation, the players score is multiplied by the number of moves in the game. Opposite of #2, this metric values slow defensive games which take longer and hopefully encourges "defensive playing."
Openness of the Game Space
The metric is based on the observation that there might be a correlation between the "number of moves you can make any point in the game" and "how fit you are as an Othello player". This method sums the possible number of moves for each player over the game (summed at each round) and then derives a fitness metric by dividing the two. For example, if over the whole game, white had a total of 30 possible moves and black had a total of ten possible moves
Straight Linear
This straight forward metric is a linear sum of the number of games won (or lost, depending on how you look at it)

Distribution and Normalization

The main problem with having so many different fitness metrics is that after each time you change the metric (i.e once per generation), you have to reevaluate the entire population. Given the extreme slowness of the regular training process, this situation of constant reevaluation isn The alternative to constant reevaluation is to normalize all fitness metrics into a single range. Each metric produces a different range for values. For example, metric #1 produces fitnesses from 0-~300 and #6 produces fitnesses from 0-5, etc. In other to normalize each measure, we had to gather a priori information about each fitness measure. In order to do this, we ran the GA for each fitness measure and determined the possible range of values for that measure. Then, during the real run, we used this previous collected data to normalize each fitness into a value between 0 and 1.

The assumption behind this process is that the distribution of values between min and max for each fitness measure is the same. This is not nessecerily a valid assumption but definitely a useful one for implementing this experiment.

Testing

The goal of our testing was to discover if a more diverse hypothesis obtained from training to multiple fitness attributes was more successful in training against the two players that we have, Edgar and the Random Player. In order to show this, we trained four different Othello players: [Standard genetic program evaluates fitness based on the number of pieces the opponent possesses, where the Modified Code varies between 4 different fitness evaluation routines.]

Standard Genetic Programming against Random Player
Standard Genetic Programming against Edgar
Modified code against Random Player.
Modified Code against Edgar

The goal of our test was not to evaluate the different fitness evaluation mechanisms but rather to evaluate the effects of them together in a changing environment. Therefore, we deliberately did not test on the individual fitness evaluation routines.

For control values against which we can compare, we trained the initial code to play against the random player and edgar. We trained with a population of 100 and 100 generations. We kept the Termination fitness value (the point at which the run would print out the best tree) as low as possible so we could get the most generational results. We did this because our modifications depend on the existence and development of many generations and we thought that it would compare better to the results of the standard code if we were to use a comparable number of generations.

Training the data took a lot of time. Due to the extra time spent running the genetic programs prior to the actual testing (see the information on normalization), and the extra computational power required by the modified algorithm, we were unable to train to as many generations with the modified code.

We try to compensate for this by changing the function by which we evaluate fitness every generation so that each generation is forced to adapt much more quickly. However, more time would have been better because our algorithm would theoretically take longer to arrive at any sort of equilibrium.

Results

Our first sets of results represent our control experiments that simply trained the code as written to have something to compare to. Then we show how our newly formed hypotheses performed in the same situations.

The results are separated out by different hypotheses. We chose to test on 3 hypotheses from each set of training data, taking one hypothesis from different points in the runs (different generation numbers). There were three trials, each playing 50 games. Because Edgar plays the same game over every hypothesis, his scores are either all or nothing and so as a point of comparison I include the margin that describes the difference between the winner's score and the loser's score.

Regular GP vs. Random Player

These are the results from a handful of hypotheses that were derived when the Othello player was trained against a random player, as in the standard code. Notice the poor performance when played against Edgar.

Hypothesis: white
[hypothesis is white player]
Trial	White Wins	Black Wins
1	40	10
2	43	7
3	42	8
Same hypothesis against Edgar (Reverse colors so hypothesis is now a black player)
Trial	White Wins	Black Wins
1	50	0
2	50	0
3	50	0
Margin: 34

Hypothesis is ( + white white_corners )
[hypothesis is white player]
Trial	White Wins	Black Wins
1	43	7
2	41	9
3	45	5
Same hypothesis against Edgar (Reverse colors so hypothesis is now a black player)
Trial	White Wins	Black Wins
1	50	0
2	50	0
3	50	0
Margin: 20

Hypothesis: ( - ( - ( - ( - ( - ( - ( / 10 black ) black ) black ) black ) black ) black ) black )
[hypothesis is white player]
Trial	White Wins	Black Wins
1	38	12
2	42	8
3	40	10
Same hypothesis against Edgar (Reverse colors so hypothesis is now a black player)
Trial	White Wins	Black Wins
1	50	0
2	50	0
3	50	0
Margin: 34

Regular GP vs. Edgar

The first set of results comes from earlier populations around generation 30 or so. The second set of data is from a later generation and the final set is from the last generation generated before we stopped the program.

Hypothesis: ( / black_near_corners ( / ( / black black_edges ) ( - white 10 )))
[hypothesis is a black player]
Trial	White wins	Black wins
1	50	0
2	50	0
3	50	0
Margin: 18 pts
Played against a random player
Trial	White wins	Black wins
1	3	47
2	8	42
3	1	49

Hypothesis: ( + ( * ( - white_corners 10 ) ( + ( - white_corners 10 ) ( / black_edges black_edges ))) black_corners )
[hypothesis is a black player]
Trial	White wins	Black wins
1	0	50
2	0	50
3	0	50
Margin: 32
Played against a random player
Trial	White wins	Black wins
1	1	49
2	2	48
3	5	45

Hypothesis: ( * ( + ( - ( / black_edges black_edges ) 10 ) black_corners ) ( / black_edges black_edges ))
[hypothesis is a black player]
Trial	White wins	Black wins
1	0	50
2	0	50
3	0	50
Margin: 32
Played against a random player
Trial	White wins	Black wins
1	2	48
2	2	48
3	0	50

Modified GP vs. Random Player

The following data is the results from running our data trained on adaptivity against a Randon Player. The first hypothesis is taken from the end of our first trial, which only lasted 30 generations. The second hypothesis is taken from the second trial, right before the 30th generation. The third hypothesis is from the end of the second trial, which ended after 80 generations.

All hypotheses were also pitted against edgar to measure their true adaptivity

Hypothesis: ( + ( + ( + ( / white_corners black ) ( * ( - 10 black_edges ) ( + black_corners white ))) ( / 10 black_edges )) ( + ( * ( * 10 ( - ( * white white ) ( * ( * black_edges white_edges ) white ))) ( * ( - white black_edges ) ( - ( + white_edges black_corners ) ( * ( / black_corners white_near_corners ) ( - 10 black_edges ))))) ( / 10 black_edges )))
[hypothesis is a black player]
Trial	White wins	Black wins
1	10	40
2	13	37
3	8	42
Results against Edgar
Trial	White wins	Black wins
1	0	50
2	0	50
3	0	50
Margin: 32

Hypothesis: ( / white ( * black_edges ( - ( / black black_corners ) ( / ( - white_near_corners white_corners ) ( + white_corners 10 )))))
[hypothesis is a black player]
Trial	White wins	Black wins
1	30	20
2	34	16
3	31	19
Results against Edgar
1	0	50
2	0	50
3	0	50
Margin: 34

Hypothesis: ( / white ( - ( / white ( / ( / white ( * black_edges ( + ( / black_corners ( + black_near_corners black )) ( + white_edges black_near_corners )))) black )) ( + ( - white_near_corners white_corners ) ( - white_corners black ))))
[hypothesis is a black player]
Trial	White wins	Black wins
1	41	9
2	37	13
3	39	11
Results against Edgar
Trial	White wins	Black wins
1	0	50
2	0	50
3	0	50
Margin: 16

Modified GP vs. Edgar

These hypotheses were trained using the modified training method against Edgar specifically. The first hypothesis is from early in the trial (15th), the second is from about halfway through (30th) and the third hypothesis comes from the end of the trial.

Hypothesis: ( + ( * white_edges white_near_corners ) ( - 10 white_corners ))
[hypothesis is a black player]
Trial	White wins	Black wins
1	0	50
2	0	50
3	0	50
Margin: 4
Results Against Random Player
Trial	White wins	Black wins
1	41	9
2	44	6
3	45	5

Hypothesis: ( / white_near_corners black )
[hypothesis is a black player]
Trial	White wins	Black wins
1	0	50
2	0	50
3	0	50
Margin: 6
Results Against Random Player
Trial	White wins	Black wins
1	0	50
2	2	48
3	0	50

Hypothesis: ( * ( + black_near_corners white_edges ) ( * white_near_corners black_near_corners ))
[hypothesis is a black player]
Trial	White wins	Black wins
1	50	0
2	50	0
3	50	0
Margin: 4
Results Against Random Player
Trial	White wins	Black wins
1	16	34
2	13	37
3	16	34

Analysis

These tests showed very interesting results overall. We will discuss them in terms of their different training methods.

Training against a Random Player

Training against a Random player in the original code proves to be successful in beating another random player, though when the hypothesis is reversed to play against Edgar, it fails miserably. This is due to the level of fitness to which the algorithm is held. The random player is relatively easy to beat and as a results the hypotheses never become too complex. Edgar is more difficult to beat.

Training against a Random player using our modified fitness evaluation mechanism yielded different results. First of all, when pitted against another random player, our hypotheses did not stand up as well, with only one hypothesis being able to withstand playing against another random player. Confusingly enough, however, these algorithms we able to beat edgar by fairly large margins. The algorithm that performed the worse against the random player beat Edgar by the smallest margin.

Explaining these results is more difficult, however I think that the random players' poor performance can be attributed to the significant difference in training time. The random player in the original code was trained for the longest in our experiments and thus had full oppurtunity to develop against solely random players, hence its improved performance against random players. However its training became so specific that it was unable to beat Edgar. Our modified player trained against a Random player and due to the many different evaluations of fitness, it was able to produce a more adaptable algorithm that could easily adapt to play Edgar.

Training Against Edgar Himself

When we trained the original code against Edgar, our results were fairly successful. Although one hypothesis failed against Edgar, this hypothesis was developed first and thus had not evolved as much. All hypothesis performed well against the random player, probably because it is easier to beat.

When we trained our modified code against Edgar, we got similar results, except this time the only hypothesis that failed was the final one. We trained our modified code against Edgar less than we trained the original code. Notice that the hypotheses trained to be adaptable won by a much smaller margin than other hypotheses against Edgar. Even our losing hypothesis lost by smaller values. This could be attributed our focus on adaptablity across a multiple of fitness attributes instead of just focusing on one. We cover more ground in terms of a variety of players instead of just trying to get the maximum gain over one specific player. When these hypotheses were played against a Random Player, the results were good, although one hypothesis that beat Edgar lost to the random player.

Overall

In the end, our results indicate that training against Edgar with the original code will yield the better player overall. In this set up, only one set of trials was lost and it ws early on in the generational developmment.

The most impressive results come from training against a less skilled player such as the random player and then pitting the resulting hypothesis against Edgar. This is where we proved to be the most successful.

Our experimentation could have been improved if we would have tested our 'adaptable' hypotheses against many different players to see how they each perform. With only two players to test against, it is difficult to show exactly how well something can perform against a variety of situations. We believe that testing the same hypotheses against a variety of different players (not just Edgar and random) will show its superiority in the end.

Conclusion

Our experiment, although it has not yielded incredibly successful results, shows how changing the way in which the fitness of a population is calculated can change the resulting hypothesis. In our experiments training against a skilled player such as Edgar, it proved more useful to train with one fitness measurement. But in our experiments training against a random player, this modification yielded a more diverse hypothesis that could beat Edgar more easily. Such a process could prove useful in real life applications where the most difficult problem is not available for training, and so by training against an easier problem (in this case the random player) and modifying the fitness evaluation you can produce a more powerful hypothesis.