CS W4771: Machine Learning
Assignment #4: Genetic Programming

Alberto Goldberger
albertog@columbia.edu

1 Introduction

For this project, I experimented with the effect of varying complexity and crossover depth constraints during the evolution of function trees to play the game of Othello.

Two thoughts motivated these experiments. The first was the desire to confirm or deny an intuitive feeling that, at least for the Othello domain, following Occam's razor -- the KISS principle (Keep It Simple, Stupid) -- would lead to better players (i.e. in general, simpler function trees lead to better solutions). The second was my unfamiliarity with Java. This choice of project allowed me to run experiments without having to do much hacking in an environment I am not familiar with.

The results of the tests I ran did not completely confirm or deny my original intution about simpler function trees. However, they did raise many new questions about the base Othello implementation supplied for this project. These questions are covered in detail in the conclusions section.

2 GP run parameters

A total of 54 runs of the GP algorithm were performed. The following parameters were held constant over each run:

• Set of functions and terminals (same as those in the base package)
• Population size (200)
• Standard fitness (the sum of black pieces left on the board after playing five games as white against a random player)
• Termination condition (either 30 generations or a player which shut out its oponent in all five games -- i.e. standard fitness equal to zero, whichever of the two happened first)
• Complexity has no effect on fitness (so that when testing the effect of increasing the average complexity of the initial population, we need not worry that the learning algorithm will start to kill off more complex individuals even if they perform well against the random player used for training)

The following parameters were varied:

• Maximum depth for creation of initial population: By increasing the average complexity of the initial population, it was hoped that the average complexity of subsequent generations would also be increased.
• Maximum depth for crossover: as the initial populations became more complex, I thought allowing the learning algorithm to perform crossovers deeper down the tree would allow a finer granularity of changes to offspring in relation to their parents.
• Maximum complexity (number of internal nodes and leaves): this would potentially allow a comparison between the best players evolved by generally simpler populations versus generally more complex populations, and highlight the effects of placing strict limits on tree complexity versus giving the learning algorithm freedom to develop highly elaborate hypotheses.

The specific makeup of the 54 runs is as follows:

• Maximum creation depth = 6, maximum crossover depth = 17, maximum complexities of 25, 50, 70, 80, 90, 100, 125, 150 and 200, each of the nine combinations run twice (18 runs)
• Maximum creation depth = 15, maximum crossover depth = 25, maximum complexities as before, two runs for each combination (18 runs)
• Maximum creation depth = 25, maximum crossover depth = 35, maximum complexities as before, two runs for each combination (18 runs)

For simplicity, specific runs will be identified throughout the remainder of this paper as "creation-crossover-complexity-number." For example, the second run with max creation depth 15, crossover depth 25, max complexity 100 would be identified as 15-25-100-2.

Almost all of the runs evolved individuals whose standard fitness was equal to zero in five generations or less. None required more than ten generations to reach the termination fitness. After all runs were complete, each of the 54 best individuals was tested by playing fifty games against a radom player, and then those which did very well against a random player were tested against Edgar.

3 Results of GP runs

The following tables show the performance of the best individuals from each run over 50 games against a random player. Columns in bold indicate the best player for each set of runs. Columns in italics indicate the worst. Note that the standard fitness for all of these players as measured by the learning algorithm during evolution is zero, even though their performance over 50 games against a random player varies considerably from player to player.

Runs 6-17-x-1:
 MaxComplexity 25 50 70 80 90 100 125 150 200 Games won 36 39 39 33 36 8 40 17 38 Tree Complex. 3 15 13 63 3 15 7 3 3 Tree Depth 2 4 5 6 1 4 3 2 2

Runs 6-17-x-2:
 MaxComplexity 25 50 70 80 90 100 125 150 200 Games won 36 37 34 7 43 38 40 37 41 Tree Complex. 11 7 3 9 13 3 3 1 3 Tree Depth 5 4 2 4 6 2 2 1 2

Runs 15-25-x-1:
 MaxComplexity 25 50 70 80 90 100 125 150 200 Games won 32 42 2 39 4 38 4 14 39 Tree Complex. 7 3 3 11 13 3 i 5 3 Tree Depth 3 2 2 5 6 2 4 3 2

Runs 15-25-x-2:
 MaxComplexity 25 50 70 80 90 100 125 150 200 Games won 39 35 6 11 40 35 37 44 31 Tree Complex. 1 3 3 7 3 3 59 1 3 Tree Depth 1 2 2 3 2 2 6 1 2

Runs 25-35-x-1:
 MaxComplexity 25 50 70 80 90 100 125 150 200 Games won 36 40 38 41 36 34 40 02 41 Tree Complex. 1 5 5 59 3 3 5 5 11 Tree Depth 1 3 3 6 2 2 3 3 5

Runs 25-35-x-2:
 MaxComplexity 25 50 70 80 90 100 125 150 200 Games won 40 32 35 0 7 35 39 48 27 Tree Complex. 3 5 1 5 47 7 3 123 5 Tree Depth 2 3 1 3 6 4 2 7 3

Best overall player: 25-35-150-2 (Apologies for the display of this tree. Java would crash, even with the display variable set, when attempting to produce a GIF of each tree). This player, which shut out the random player on almost all of the 48 games it won, lost to Edgar with a score of 18 to 46 (each instance of the word "black" was changed to "white" and vice-versa so that this player, trained to play white, could play against Edgar, which also plays white).

```( +  ( -  ( /  ( -  ( *   10   white_near_corners )
( +  ( +   10   black_edges )
( *   black_edges   white_near_corners )))
( -  ( +  ( /   black_corners   white_corners )
( *   10   white_corners ))
( /  ( -   black   white_near_corners )
( +   black_near_corners   white_edges ))))
( /  ( +  ( +  ( *   black_edges   10 )
( +   black_edges   white_edges ))
( *  ( /   black   white_near_corners )
( +   white_edges   white_edges )))
( *  ( /  ( *   white_edges   10 )
( +   white_near_corners   black ))
( /  ( /   black_near_corners   white_edges )
( *   10   black_near_corners )))))
( -  ( /  ( *  ( +  ( *   black_edges   black_near_corners )
( -   black_corners   10 ))
( -  ( *   white_near_corners   black_corners )
( -   white   white_corners )))
( *  ( *  ( +   10   10 )
( *   black_corners   black_corners ))
( +  ( /   black   black_edges )
( *   white_near_corners   white ))))
( +  ( +  ( +  ( -   white_corners   black )
( *   white_corners   white ))
( +  ( +   white_edges   black_near_corners )
( +   black   white_edges )))
( /  ( *  ( /   white   black_near_corners )
( /   black   black_corners ))
( *  ( /   white_corners   black_near_corners )
( +   black_edges   black_edges ))))))```

Worst overall player: 25-35-80-2

`     ( /  ( *   black_near_corners   white )  white_corners )`

Some other good players:

1. 15-25-50-1

`     ( -   black_corners   black )`

2. 15-25-150-2

`     ( white )`

4 Conclusions

Unfortunately, the tables above do not establish any clear pattern which might confirm or deny my intution that simpler trees would perform better. The best tree (25-35-150-2) is so complex, it's nearly impossible to figure out what sort of "reasoning" about the problem domain it uses to achieve its fitness. Yet other relatively simple trees performed fairly well also (15-25-150-2, for example).

If any conclusion on the original question can be reached, it's that maximum complexities or crossover depths do not have a significant effect on the fitness of resulting hypotheses. But because the data gathered does not conclusively deny that simpler trees are better or that complex trees are better, the "no significant effect" hypothesis must remain unconfirmed. In this respect, I've simply added a new hypothesis to the original hypothesis space, only complicating the size of the problem. Fortunately, some useful information can be salvaged from these experiments in the form of observations made while analyzing the results of the GP runs. In particular, these observations provide some insight into the effects of the chosen fitness measure and into possible reasons accounting for general tendencies in the evolved function trees. The following two sections discuss these matters.

4.1 Fitness measure

It becomes clear from these experiments that training using a random player to determine standard fitness is not a good idea. The standard fitness of each individual in the population of 200 was measured by playing five games against a random player, with the resulting fitness calculated as the sum of black pieces remaining at the conclusion of each game. All of the 54 "best" players above achieved a standard fitness of zero during their respective GP runs. In other words, they completely shut out their oponent over five consecutive games.

One would expect, then, that a similar performance measure would have been observed when each of the 54 individuals was tested over 50 games against a random player. But not one of the players managed to shut out the oponent in all 50 games. Some, in fact, did very poorly, winning less than ten of the fifty games played against the random player. Either the opponent randomly played very poorly while the standard fitness was evaluated, or it randomly played very well during subsequent testing over 50 games.

In either case, a random player is an unreliable predictor of fitness. From another viewpoint, it's like trying to learn to draw a circle by watching someone draw squiggles. One can't learn to rationally tackle a problem through the example of someone whose decisions are based on chance. The fact that the best overall function tree shut out the random player in almost all of the games it played, but then lost by a large margin to Edgar, illustrates this point quite well.

On the other hand, when time is an issue, training against anything but a random player can be difficult, since the evaluation time, which is of constant order for the random player, begins to increase with increasing sophistication of the opponent against which fitness is measured.

Perhaps the optimal approach would be to begin training against a random player in earlier generations, and then increase the sophistication of the opponent used to measure fitness (either by having individuals in the population compete with the best ones or by using an established player of the desired skill level) as the average fitness of the population begins to rise.

4.2 General tendencies

One of the most salient details observed after runs were completed and individuals were tested over 50 games against the random player was that for the great majority of the individuals, most of these would frequently shut out the random player in each winning game. There were some exceptions to this, of course (two or three of the players came close to ties, either from the winning or the losing side, on each of the 50 games). This tendency appears to hold both for the more complex and simpler trees, and for trees which won an overwhelming majority of their games against radom and those which won a less pronounced majority of games.

The most obvious explanation I can offer can once again be attributed to the fact that training was done against a random, "irrational" player. Because the learning algorithm did not play against an opponent whose moves followed any sort of logical pattern, the best approach seemed to be a greedy one: prefer moves which maximize the number of pieces converted to white, since it is difficult to predict what how the opponent will respond to any moves. The result was that many of the evolved trees contained as terminals either just "white" or "white" added to or multiplied by the terminal 10. Other trees did the opposite, subtracting from 10 the number of black pieces remaining after a move by white to achieve a similar effect.

This leads to the next observed tendency: regardless of the constraints placed on maximum complexity, the best tree generated by each run was much more likely to have a very simple structure (depth less than or equal to three) than more complicated structure. In fact, despite the fact that all runs allowed complexity measures of at least 25 tree nodes, only 5 trees of the 54 best had complexities greater than 15. Of note is the absence of trees of complexities between 15 and 45 (which were allowed by all except six of the test runs). All of the evolved trees fell either below or above this range. Perhaps the simpler trees (generally implementations of greedy algorithms) survived well following the reasoning above, while the most complex trees encoded enough information to develop some sort of game strategy. Those in between, then, maybe died because they were neither completely greedy nor did they implement a reasonably sophisticated strategy. While there is no evidence to substantiate this theory, and it is difficult to gain an understanding of the strategy of large trees because they are so complex, it is an interesting question to consider in further detail.

albertog@columbia.edu - December 1997