For example, I had noticed the benefit in an experiment that "learned" algebraic functions. Many of the successful individuals did match the output of the target function, but were much more complex. Closer examination revealed that the successful individuals contained branches that evaluated to zero (for example, a branch that equaled x*x*x*(x-x): x-x = 0, which caused the entire branch to equal zero, irregardless of the input). When these individuals were reduced to take these zero branches into account, the equaled the target function.
On the other hand, my early experiments with the Othello package implied something completely different. Many of my early runs quickly and consistently produced one "best" individual of complexity one. Apparently, the complexity was affecting the evaluation of the population to produce an individual that was probably not the best individual in that generation, and the population tended to skew toward less complex individuals.
My solution to this problem was to attempt to scale down the complexity's contribution to overall fitness. I would attempt to find a good scale to reduce the contribution of complexity from a major indicator to a way of differentiating individuals of different lengths in a manner that would favor the smaller (meaning reduced, or all branches can factor in the game strategy) individuals over larger individuals that performed just as well otherwise.
I was provided with one previously created and tested player. I tested this against 500 random players to map out a few benchmarks. In this test, the player won 31.2% of its games, with varying performance. Some of the games were very decisive wins or losses, while others where fairly close. I also tested the current package out to see what other players it would produce. One individual appeared often, and seemed to perform well. It was a single node, returning the value of the color that the player is playing, effectively attempting to maximize the number of pieces of that color at every move. This player, playing against 500 random players, won 46.4% of its games, again with widely varying results.
Although I thought that alot of the current results were due to the complexity factor, I also wanted to try to vary the fitness measure of the individual to better represent what had needed to be learned. I also wanted to attempt to incorporate some of the Edgar's skill into my players. The existing fitness measure at the end of the game was simply based on the opponent's score at the end of the game (in this case, lower fitness measures mapped to "fitter" individuals, while higher fitness measures indicated poor performance). Instead of basing my results on the end of the game, I decided to try an evaluation that found the difference between Edgar's ranking of each the players' moves and the ranking of what Edgar thought was the best move, averaged over the course of the game.
I didn't want complexity to play a major factor in the fitness of an individual. I was hoping to reduce the factor that fitness played enough that it would not effect the comparitive fitness of individuals that performed differently in play. I did not want a player that was not as good at Othello but much less complex to be rated as "fitter" than an individual that performed better in the game; I wanted complexity to act as a tiebreaker of sorts between two (possible functionally equivalent) individuals that performed the same while playing, but had differing levels of complexity. I hoped to accomplish this by reducing the amount that complexity added to fitness to the approximate order of differing fitness.
The final factor of complexity/50 seemed to suit this well. Without computing the mean of fitness and complexity over the entire test, I would say that most of the values of complexity ranged somewhere between 10 and 25, while most of the fitness measure ran between 100 and 170. Since these complexities were divided by 50, most of the fitness measures were increased by between .2 and .5, or roughly on the order of 10^-3. With my new game performance measure in place, I ran two tests -- one using this reduced complexity as a factor in fitness, and one that ignored complexity altogether -- to 100 generations with a population size of 200. These tests were actually quite successful -- a striking difference in the average complexity , the average fitness, and the fitness of the population over time resulted. In the higher generations, the test with complexity grew to more examples of a few number of individuals (three were very common, lead by an individual with a complexity of three), while the average fitness of the population began to stagnate as the variety dipped lower. Meanwhile, the test without complexity grew a population of very large, complex individuals, but the variety actually increased to 100% (a population consisting entirely of unique individuals), while the average fitness of the population continued to get better, finishing substantially lower than the fitness of the population that used complexity as part of fitness.
I tested this concurrently with the complexity changes in the test that I have outlined in the previous section. Once I had evolved the best players, I chose several that had the best fitness for performance testing against random players. I chose the six top players from the test that did not use complexity as part of the fitness measure, and eight players from the test that did use complexity as part of the fitness measure. (I had originally intended to test five players, but settled on six and eight because last few players were very close in terms of fitness.) I tested each set of players against a random player for fifty games, and if there was one that was clearly performing worse than the others (in terms of winning percentage), I eliminated it from the set. I eventually tested the best players for 500 games. The results were rather discouraging.
The players that were evaluated without complexity did the best, adding some weight to my argument about complexity. However, they did not perform that well overall. The two best individuals from the test that did not use complexity lost more than they won against random players (36.6% and 36.8%). While this was better than the existing complex player (31.2%), this was still much worse than the player that just attempted to maximize its pieces at every move (46.4%). While this did represent an improvement over the current method, it was still very poor. A very simple heuristic was outperforming it significantly, and simply picking a random move was winning almost two out of three games against it.
The players that were evaluated with complexity as a factor in fitness were much worse. The best player from that test won 32.6% of its games, but was taken from the second of 100 generations--long before the population really had time to evolve. The next best player managed to win 25.6% of its games against random players, the third best player won 23.6% of its games, while the last five of the best players all had winning percentages in the teens. (However, the test using complexity was used more for determining the role of complexity in evolving genetic players than to evaluate the fitness measure.)
I examined the performance of the best players. What I discovered by looking over their results and their games was encouraging. Although they lost more games than they won, they usually had fairly close losses, and very decisive victories. While some of this can be attributed to the random players, I felt that this indicated that they had learned something. Although they had not learned all of Edgar's skill, they had acquired some of Edgar's ability to take advantage of opponent's mistakes. They simply needed to learn to win. This was understandable, since the current evaluation did not take win or loss into account, but merely how well Edgar thought that they had played.
I devised a modified fitness measure, based on what I had observed. I decided not to use complexity as part of the fitness, since the previous test provided me with strong evidence against this increasing performance. I also increased the number of games that each individual played from five to ten in an attempt to decrease the amount of noise from the random players (Toward the end of the test that used complexity, I found that two or three players made up the entire population, and accounted for a wide range of fitness--most likely due to one getting a series of bad random players, and another getting a series of good random players. I hoped that by increasing the number of games, each individual would face a better variety of random players, and the chances of an individual being evaluated as better or worse than it actually was would decrease.) Finally, I modified the fitness measure itself. The new fitness would be the previous fitness (the average of the difference between Edgar's evaluation of the player's move and Edgar's best move, multiplied by ten) multiplied by the ratio of the opponent's score over the player's score at the end of the game. (This was quickly changed to the opponent's score plus one divided by the player's score plus one, in order to avoid zeros in the numerator or denominator.) The theory behind this approach was to continue taking Edgar's evaluation into account, while adding an incentive to win. Edgar's evaluation would be scaled up to a significantly larger number for a loss against the random player, and would be scaled down significantly for a victory against the random player. Furthermore, the player would be penalized very severely for losing badly, and rewarded greatly for a decisive victory.
This method seems to have worked the best from my tests against the random players. Out of the ten players that I tested (there wasn't much difference in the fitness measures of the best players), four of the players score above 40% (46%, 44%, 43.2%, and 41.4%), the best of which came very close to the strategy that maximized the number of pieces at every move (46.4%, or a difference of 0.4%). This was also significantly better than the best individuals produced with the other fitness measure. While it was not able to win more than it lost against the random player, and equalled the winning percentage of the simply heuristic player, it was a significant improvement over the existing strategies.
The test of the different evaluation functions was not quite as successful. My target was to create a player that could compete on Edgar's level, but I was only able to produce players that could compete with a very simple heuristic, and could almost hold there own against a player that randomly chose legal moves. However, I do feel that it was successful for several reason. The rising winning percentage over my test cases, compared to the "best" player that I had been given to start with, indicates that the fitness measure had been improved, that something had actually been learned. Furthermore, the final test (using the fitness measure that multiplied the average evaluation by the ratio of the score) did beat Edgar decisively in one respect. When I tested Edgar in 500 games against the random player, it had a winning percentage of 96%. However, it never defeated a random player before the board was filled. Meanwhile, all four of the best players in the last test had multiple games that ended before the board was filled (winning before filling the game board between 3.6% and 5.6% of the time, with two players each winning one game by the score of 16-0). While none of the players were able to consistently defeat the random player, they often had very strong wins, indicating that progress has been made.