CS W4771 Machine Learning, Homework 4

Competitive Fitness using a Single-Elimination Tournament

Yi-Min Chee
11/11/97

Introduction

One of the difficulties in applying genetic algorithms to game-playing tasks is achieving results which generalize. Typically, the output generated by the GA must be able to play well against a wide variety of opponents (employing different strategies) in order to be considered successful. This usually requires the population to be exposed to a large number of strategies during the course of evolution. Such breadth of exposure can be provided by using exemplars for various strategies which are then employed to evaluate the fitness of individuals in the population. However, this approach has the drawback of requiring such exemplars to be identified. This can be a difficult task, especially given the requirement for breadth and the fact that in many cases each exemplar must be individually hand-crafted. One solution to this problem is to use a competitive fitness measure that pits individuals in the population against one another in order to determine their relative fitness. This both eliminates the need to identify exemplar strategies and allows the opponents to evolve over time as the population improves.

There are several ways of pitting individuals against one another to determine relative fitness. One possibility is to have each individual play a fixed number of games against a random sampling of the population, with fitness derived from the individual's performance over those games. An alternative is to play a single-elimination tournament among the individuals in the entire population. The latter method has the advantage of requiring fewer games, thereby reducing training times but potentially incorporating more noise into the learning process (in each generation, half of the population is only judged solely on the result of a single game).

This assignment examines the use of a single-elimination tournament as a competitive fitness measure in a GA system for generating Othello players. The tournament fitness measure is evaluated both in terms of its effects on learning time and on the generalization success of the resulting individuals. The experiments also address the problem of measuring progress when a competitive fitness measure is employed.

Approach

For this assignment, two baseline testcases were used. Both cases consisted of learning using the standard GA with no competitive fitness. The fitness measure used for the first baseline case was performance in five games played against the random player (a player which ranks potential moves using a random score). The GA was executed for 10 good runs, with a maximum of 51 generations before giving up. The population size for the baseline was 128. The termination fitness was set to 0.1, which meant that the resulting player had to beat the random player in all five games without allowing the random player to have any pieces at the end of any game. For all of the cases, complexity was not factored into the fitness score (the complete GA parameters for all experiments are provided here).

The other baseline case used the same setup as the first, but with fitness measured by playing against Edgar rather than the random player. Since both Edgar and the individuals in the population play the game deterministically, it was not possible to play a series of games to evaluate each individual (an attempt was made to randomize the GA players by selecting a move at random when there were several choices with the same value, but this resulted in degraded performance of the GA players against even the random player!). The population size was halved to 64 due to time constraints. The termination fitness for this experiment was set to 20.0, which meant that the resulting player was required to beat Edgar by a reasonable margin.

The first two competitive fitness experiments replaced the fixed-set evaluation of the baseline testcases with a single-elimination tournament, which was used to derive the fitness of each individual in the population. In each round of the tournament, if an individual had lost in the previous round, the individual was "awarded" a score of 64 (the worst possible) in order to indicate that it was out of the tournament. This produced the intended effect of favoring those individuals who advanced farther in the tournament. It also meant that all of the losers in a particular round were penalized equally, with their fitnesses differentiated by their performance in earlier rounds. In the experiments conducted, the population size was set to a power of two to avoid byes, but the tournament-playing code was written to handle byes by awarding the reciever of the bye a score equal to its average score in the previous rounds of the tournament (so that the bye neither directly helped nor directly harmed the player's score).

In the first competitive fitness experiment, the progress of the GA was tracked by measuring only the tournament winner against the random player. The experiment was run with identical parameters to the first baseline. Again, the resulting player was required to beat the random player in five consecutive games without allowing the random player any pieces at the end of any of the five games.

A similar experiment was carried out using performance against Edgar as the measure for determining termination fitness. This experiment used the same parameters as the second baseline.

These first two competitive fitness experiments used only the single-elimination tournament to guide evolutionary progress. This meant that there was no direct evolutionary pressure on the individuals to improve with respect to the measure used to determine success (performance against either the random player or Edgar). As a result, the population might fail to make measurable progress towards a successful outcome. A final experiment combined the single-elimination tournament with fitness measured against Edgar, in an attempt to see if applying such evolutionary pressure would help the learning process. For this experiment, each individual's tournament result was weighted evenly with its score in a game played against Edgar and the sum of the two was used as the measure of fitness. The experiment was conducted using the same parameters as the second baseline, but with the termination fitness for the experiment set at a level that would require both good performance against Edgar and good performance in the tournament.

The test runs for the baseline testcases were compared with the competitive fitness test runs to measure the effect of competitive fitness on both the learning rate and the time required. Because the execution time varied due to the presence of other processes, this comparison was based upon the number of generations and games played.

Finally, the resulting players from the baseline testcases were evaluated against the players generated using competitive fitness measures in order to determine the generalization success of players.

(additional details of the modifications to the original Othello GA system are provided here)

Results

First Baseline Comparison

The resulting players for the first baseline testcase are depicted here (with duplicates removed). Although the termination fitness appeared to be fairly selective, in fact this fitness bar did not appear to be very difficult to reach. On average, the GA required just 2 generations to arrive at a solution that exceeded the termination fitness. The resulting players for the first competitive fitness experiment are depicted here. In this case, the GA required an average of 7.2 generations to arrive at the desired fitness level.

In terms of the learning time, while the baseline case required fewer generations, the number of games played in each generation was 640 (128 individuals each playing five games), while the competitive fitness experiment played only 132 games per generation (127 for the single-elimination tournament plus five games for the tournament winner against the random player). Thus the average number of games played per run was only 950.4 for the competitive case vs. 1280 for the baseline. These results are summarized below.


    Testcase               Games/      Avg. # of      Avg. # of
                         Generation   Generations   Games Played
    ------------------------------------------------------------
    Baseline                640           2.0          1280.0
    Tournament              132           7.2           950.4

In order to measure generalization success, the resulting players from the baseline were pitted against the resulting players from the competitive fitness runs. Two games were played between each pair of players, with each side playing both as black and white. In addition, the players were also evaluated against the random player and against Edgar. The results are shown below.


                                        Games won vs.
    Testcase (# players)         Other     Random (%)  Edgar (%)
    ------------------------------------------------------------
    Baseline (6 players)       47 (39.17%)    100%        0%
    Tournament (10 players)    73 (60.83%)    100%        0%

Both sets of players fared well against the random player. This was expected, since performance against the random player was used precisely as the measure of fitness in the baseline case and also as the test for the termination condition in the tournament case. Surprisingly, neither set of players performed well against Edgar, with both sets losing all games. More relevantly, in head-to-head competition, the players generated using competitive fitness won 60.83% of the games. This supports the notion that competitive fitness exposes the population to a wider range of strategies, thus resulting in more generalizable solutions.

Second Baseline Comparison

For the second baseline, the GA yielded the players depicted here. On average, over the runs of the GA 6 generations were required to arrive at a solution that met the termination criteria. With a population size of 64 individuals, this meant an average of just 384 games played per run (since each individual played just one game against Edgar).

For the second competitive fitness experiment (where only tournament result was used to determine fitness), the GA generated the players shown here. The GA required an average of 45.4 generations to reach a successful solution, taking into account two bad runs which were terminated in failure after the maximum number of generations had been reached. With 64 games played in each generation (63 for the tournament over a population size of 64 individuals plus one game to evaluate the best individual against Edgar), this meant on average 2905.6 games played per successful run, many more than the number required for the baseline.

The final competitive fitness experiment resulted in the players depicted here. On average, 20.7 generations were required to reach a successful solution, accounting for one bad run which failed to achieve a good solution in the maximum number of generations. Since each generation required a total of 127 games (63 for the tournament and 64 to evaluate each individual against Edgar), the average number of games played per good run was 2628.9. This is still many more than the number of games required in the baseline case, but fewer than the number required when only tournament result is used to determine fitness. In addition, the average number of generations required by this hybrid fitness case was less than half of the number required in the pure tournament case. This lends support to the idea that incorporating direct evolutionary pressure reduces the number of generations required to arrive at a solution.

The preceding results are illustrated below. The line labeled Tournament refers to the second competitive fitness experiment (using only tournament result to evaluate fitness), while the Hybrid line refers to the final experiment (employing both tournament result and performance against Edgar as the fitness measure).


    Testcase               Games/      Avg. # of      Avg. # of
                         Generation   Generations   Games Played
    ------------------------------------------------------------
    Baseline                 64           6.0           384.0
    Tournament               64          45.4          2905.6
    Hybrid                  127          20.7          2628.9

As in the previous case, the resulting players from the second baseline and the final two competitive fitness runs were pitted against one another for the purpose of determining generalization success. Again, each pair of individuals played two games with each side playing as both black and white. The players were also evaluated against the random player and against Edgar. The results are shown below.


                 Baseline vs. Pure Tournament

                                        Games won vs.
    Testcase (# players)         Other     Random (%)  Edgar (%)
    ------------------------------------------------------------
    Baseline (10 players)       21 (21%)      60%       100%
    Tournament (5 players)      79 (79%)      60%        80%


                  Baseline vs. Hybrid Fitness

                                        Games won vs.
    Testcase (# players)         Other     Random (%)  Edgar (%)
    ------------------------------------------------------------
    Baseline (10 players)       64 (32%)      60%       100%
    Hybrid(10 players)         136 (68%)      80%        80%


               Pure Tournament vs. Hybrid Fitness

                                        Games won vs.
    Testcase (# players)         Other     Random (%)  Edgar (%)
    ------------------------------------------------------------
    Tournament (5 players)      52 (52%)      60%        80%
    Hybrid(10 players)          48 (48%)      80%        80%

All three sets of players performed well against Edgar. In the pure tournament case, the single loss to Edgar occurred because the individual with the best fitness in the population was not the tournament winner (this could have occurred if the tournament winner narrowly beat its opponent in the last round of the tournament, with that opponent having a better score average for the previous rounds). Although the implementation measured the tournament winner against Edgar and determined that the termination fitness had been reached, the individual with the best fitness was selected by the GA as the resulting player. In the hybrid case, of the two games that were not victories against Edgar, one was a tie and the other was a narrow loss. In both cases, the combined fitness of the individual was still lower than the termination fitness, indicating that the player performed very well in the tournament half of the fitness measurement.

As expected, in head-to-head competition against the baseline testcase, both the pure tournament and hybrid competitive fitness players outperformed the players generated without using competitive fitness. In the pure tournament case, the resulting players won 79% of the time, while the hybrid players claimed 68% of the games against the baseline players. When measured against each other, the players generated using pure tournament and hybrid cases were nearly evenly matched, with the slight edge going to the pure tournament players. On the other hand, the hybrid players performed slightly better against the random player.

Conclusion

A single-elimination tournament employed as a competitive fitness measure can be more effective at producing generalizable solutions than a fitness measure based on a small number of exemplars (over multiple games, the random player approximates some set of exemplars in the experiments described above). When using such a competitive measure, care must be taken to ensure that the learning time for the GA does not become excessive due to the lack of evolutionary pressure towards achieving a solution. The hybrid solution discussed here illustrates one approach that can provide such evolutionary pressure by combining the tournament result with a separate, fixed measure of fitness.