PuyoPuyo & Genetic Programming

Final Project: PuyoPuyo
Juno Suk, jsuk@lucent.com

Introduction
What is PuyoPuyo? PuyoPuyo is an addictive game in the same falling blocks genre as Tetris. You are given pieces, each consisting of two colored balls. The goal is to shift and rotate and drop these pieces into a 6 column, 14 row pit in such a way that the colored balls of the same color group together. Once four or more of the same colored balls are grouped together horizontally, vertically, or a combination of the two, they disappear and any balls above them drop straight down to fill their place. The goal of the game is to prevent them from filling to the top (like Tetris).

Simple game so far, but to aspire to higher scores, a little more intelligence is required. Greater points are given for groups of balls disappearing in a chain reaction sequence. For example, when a group of like-colored balls disappear, the pieces above drop down to fill their place. If, upon dropping, these balls find themselves in a group of four or more like-colored balls, then they too will disappear. Bonus points are given for this. If, after these balls disappear, another group is formed upon dropping, then they too disappear and give you even more bonus points. In this way, the game becomes more exciting and requires a bit of intelligent foresight.

When you add head-to-head two player competition to the game, it becomes an even greater challenge since bonus points translate into sending obstacles (rocks) into your opponent's pit. For full instructions on this game, please click here. This game is best learned by playing it. Click here to play it.

The objective of this project is to genetically program AI players who are capable of playing this game at a level comparable to experienced human players. The Java programming language is used to implement every aspect of this game, including the use of the gpjpp package for generating the GP populations.

Approach

Fitness
Breaking the problem down, it was determined that there are two goals to focus on when creating the PuyoPuyo AI player. These two goals are closely knit to how the fitness of the players would be measured. The first goal would be to create a player capable of merely surviving for an extended period of time. That is, create a player that knows better than to just stack pieces haphazardously up to the top. The more difficult second goal would be to create a player capable of setting up the pieces in positions conducive to chain reactions to get those bonus points (or to send rocks over to the opponent in head-to-head play).

To satisfy these two goals, the fitness measures were measured in two ways.

For every piece that is dropped into the pit, the survival fitness (s-fitness) is incremented 1 point. That way, the longer the player survived, the higher the s-fitness, and therefore, the higher the chance that this player's GP function knows how to keep it alive.

The second fitness measure, the position fitness (p-fitness), is measured directly by how many rocks would be sent over from a group or groups disappearing had the player been playing with an opponent. This translates to:
1 group              1 rock
1 group + n balls    1 + n rocks
2 groups             1 row (6 rocks)
2 groups + n balls   1 row + n balls (6 + n rocks)
3 groups             2 rows (12 rocks)
4 groups             3 rows (18 rocks)
5 groups             4 rows (24 rocks)
6 groups             5 rows (30 rocks)
So, if the player were to make one group of 4 and one group of 6 disappear, then he would send over 8 rocks since it would be a chain sequence of 2 groups plus an extra 2 balls (6 + 2 rocks). Therefore, his p-fitness would be incremented by 8 points.

The question then arose as to how the two would work together in creating the AI player. There needed to be some sort of balance between the two fitnesses. Horn et al. (1994) says, "Historically, multiple objective have been combined ad hoc to form a scalar objective function, usually through a linear combination (weighted sum) of the multiple attributes..."[1] Though Horn goes beyond this linear combination in his work, the linear combination approach would suffice for this project. To find the Pareto optimal front in this project was relatively simple, and actually unnecessary as was found out later. But it was decided that populations would be created at five different points along the front. The five points are represented by the following fitness equations:


p05:   s-fitness * 0.05 + p-fitness * 0.95
p25:   s-fitness * 0.25 + p-fitness * 0.75
p50:   s-fitness * 0.50 + p-fitness * 0.50
p75:   s-fitness * 0.75 + p-fitness * 0.25
p95:   s-fitness * 0.95 + p-fitness * 0.05

At the extreme edges of this front was the percentages of 5% and 95%. At first, I had tried 0% and 100% but this would cause too many ties in best fitness, especially when s-fitness was set to 100% since many of the AI players were able to survive indefinitely and would max out the s-fitness score to 144.

Primitives
When creating the primitives to implement the genetic programming for the the AI player, a decision of whether I would retain complete domain-independence and follow the standard model of GP or if I would go and actually try to enhance the playing algorithm by hand which may result in better results, albeit with some human intervention. This dilemma is addressed by Janikow(1993), "On one side of the spectrum lies a method that only uses the classical operators of mutation and crossover. This approach is conceptually easy... On the other side of the spectrum lies a knowledge-intensive method that completely abandons the trandition domain independent operators, and, instead, fully implements the specific problem-solving methodology." [2]

It was decided to give the GP full reign on how my AI players would play. I would only provide the primitives for it to work with:

Primitive	Arguments	Factor	Description
+	2	N/A	Sums the two arguments
-	2	N/A	Subtracts the two arguments
*	2	N/A	Multiplies the two arguments
/	2	N/A	Safe divides the two arguments
height	0	1.5	height is a measure of how tall a column is. But this primitive is special in that it doesn't return the straight height but maps the height to an exponential function so the higher columns give extremely low numbers to prevent pieces from being placed there for survival's sake. Here is the mapping: height of 0 -> 10 height of 1 -> 9 height of 2 -> 9 height of 3 -> 9 height of 4 -> 9 height of 5 -> 8 height of 6 -> 8 height of 7 -> 8 height of 8 -> 7 height of 9 -> 7 height of 10 -> 6 height of 11 -> 5 height of 12 -> 4 height of 13 -> 2 height of 14 -> 0
sides	0	3 x 2.0	Counts the same-colored pieces to its left, right, and bottom. This terminal, above all, is probably the most important since it will be responsible for causing the same-colored balls to group together.
diagonals	0	5 x 1.2	Counts the same-colored pieces to its upper-left, upper-right, lower-left, lower-right, and two pieces below. This terminal will hopefully facilitate chain reactions by causing same-colored groups to form when its own column or a neighboring column that has a same-colored ball it is diagonal to shifts down one.
diagonals2	0	4 x 1.5	Counts the same-colored pieces two rows up and to the left and right, and two rows down to the left and right. This terminal has the same idea as the diagonals terminal except now the chain reactions would come when its column or a neightboring column shifts down two.
column	0	9 x 0.67	Counts all the same colored balls in the same column.
adjcolumn	0	18 x 0.33	Counts all the same colored balls in the neighboring column(s).
4	0	1

The factor column shows the highest possible value of the primitive along with its multiplier used to normalize the output values so that no single primitive would dominate and cause the "can't see forest because of trees" problem. As you can see, all factors multiply their primitive max to a product of 6.0, with the exception of the constant 4 and the terminal height.

Evaluation
So how are the moves evaluated in the game of Puyo?

Given any piece, there are 4 different rotation positions, two vertical and two horizontal. In the horizontal position, the piece has five possible column positions. In the vertical position, the piece has six possible column positions. So, all together, there is a total of 5 + 5 + 6 + 6 = 22 moves available to any given piece. Before running through all the moves, to avoid repeating calculations, the evaluation function calculates the values for side, diagonal, and diagonal2 for each column four times- for each of the two ball colors in the current piece it calculates the primitive twice, once with the ball right above the highest ball on the column, and once with the bell two spaces above the highest (for the vertical piece orientation). So a maximum of 3 primitives * 24 = 72 values will be obtained for these three primitives. For the height, column, and adjcolumn primitives, it will find a two values for each column, one for each color in the piece. This may seem like a lot of values to obtain but surprisinly, any time in calculating these values is unnoticeable. Fewer values are usually found anyway since the evaluation function will not calculate for primitives not found in the current GP function tree. Further reductions are made when the both balls in the current piece are the same color.

Once all the primitives have been calculated for each column, color, and position, the function runs through the 6 * 2 possible locations each ball in the current piece could go, and evaluates the GP tree at each location.

Finally, the evaluation runs through each of the 22 moves and determines which one gives the maximal output. After determining the optimal move, it will send the right sequence of moves (shift left, shift right, rotate, and down) to a queue which will place the piece in the correct position. If evaluation finds that there are more than one optimal move, then it will randomly choose one, each move given the same probability of being selected.

Training and Testing
For the training aspect of this project, first and foremost, it had to be decided who each player would train against. Would it train against other humans, AI players, or itself? And what would be the criteria for terminating a player(s) training session?

Playing against humans was immediately ruled out since there weren't enough humans around who had the time to play the thousands of AI hopefuls in the GP population. Playing against other AI players in the population in a tournament model was a possibility, but it posed some problems, such as termination criteria. What if both players play equally good and neither refuses to lose. Should there be a time limit on each game? It was finally concluded that the best way to train the players was against itself. It would play solo until it either reached the top of the pit or the pieces ran out, at which point, the s-fitness and the p-fitness would be added with the appropriate weights. The resulting sum would then be divided by an arbitrary constant so better fitness would give lower scores, to be consistent with the given gpjpp java package.

Termination criteria for each player can be satisfied in two ways. The first way is the pieces reach the top of the pit. The second way is to limit the total number of pieces per game to 144. 144 was chosen because it takes 36 pieces to fill up a pit, and it was arbitrarily decided that if a player could last long enough to fill up the pit four times over (4 * 36 = 144), then its survival capability was good enough to go on indefinitely.

For the population training phase, each player would go through three games with itself, each time with another set of pieces. The 3 piece sets were pregenerated using a program that would generate 288 digit strings, each digit in the range of 1-6. The numbers 1-6 represent the 6 possible colors of a ball. Though the game can be played using anywhere between 4 through 6 colors, the players trained under the most challenging since a player that can play under all settings is desired. A player that can play with 6 colors is more likely to be able to play with 4 balls than a player trained with only 4 colors is likely to handle 6.

In the player evolution process, fitness was not penalized for tree complexity. Though in most real-time system, tree complexity would be an issue, it was discovered that the game engine used here is fast enough that the the tree complexity doesn't cause a noticeable difference in move-to-move play.

There was a total of 5 different populations, each running under a different section of the pareto optimal front, and each population underwent 21 generations. The best of each generation was then run through the validation set, consisting of 10 games of preset 288-piece sets. After determining the best generation from each population, the 5 AI players from each populations' best generation was taken and run through a final testing set, consisting once again of 10 games of preset 288-piece sets. None of the 23 preset piece sets used from training through testing were identical since each were randomly generated individually. This was to assure that there would be no bias.

Bias, though, is really not a concern nor is it a threat in this project since all cases were randomly generated and since in every game, there is an element of randomness. In other words, the same player playing with the same set of pieces the second time, will not play it exactly as it did the first time. There are many occasions in the evaluation period that the GP tree evaluation will yield more than one possible moves, at which point the program will randomly choose which move to take.

Both the validation set and the training set were run using a p05 Pareto setting where s-fitness is 0.05 and p-fitness is 0.95. This was chosen because it was found that most players that had evolved this far could max out its survival fitness anyway. Would choosing a p05 front for validation and testing bias the results? Possibly, but not in this case obviously since the AI player that was evolved from the p05 front did the worst out of all 5 of the final players.

The 5 GP representations of the remaining AI players were programmed into the PuyoPuyo game configuration and are available as players, AI #1 through AI #5, AI #1 being the easiest and AI#5 being the hardest, as determined by the testing run.

Results
Here are the resulting 5 AI players, their functions, their fitness scores from their testing phase, the section of the Pareto optimal front they're from, and the delay in milliseconds between piece moves. (The delay value is arbitrary, not genetically evolved. The smaller the delay, the faster the player.)


AI #1
-----
Function:  ( +  ( /   diagonals2  ( *   sides   column ))
                ( +  ( /   sides  ( *   4 ( *   sides   height )))
                     ( +   sides   diagonals )))

s-fitness: 1382 out of 1440

p-fitness: 1127

pareto:    p05

delay:     400 ms



AI #2
-----
Function:  ( +  ( +  ( +   height   sides )  sides )  adjcolumn )

s-fitness: 1440 out of 1440

p-fitness: 1152

pareto:    p95

delay:     300 ms



AI #3
-----
Function:  ( /  ( +   diagonals2  ( +   sides ( +   sides
                                                ( /   sides   diagonals2 ))))
                diagonals )

s-fitness: 1418 out of 1440

p-fitness: 1158

pareto:    p75

delay:     250 ms



AI #4
-----
Function:  ( +   sides  ( +   height   adjcolumn ))

s-fitness: 1440 out of 1440

p-fitness: 1166

pareto:    p50

delay:     225 ms



AI #5
-----
Function:  ( +  ( +   diagonals2  ( /  ( +  ( +   sides  ( -   4   sides ))
                                            ( *  ( / ( /   sides   sides )
                                                     column )
                                                 sides ))
                                       column ))
                ( *  ( +   sides   sides ) sides ))

s-fitness: 1440 out of 1440

p-fitness: 1226

pareto:    p25

delay:     200 ms

To see all the output, click here.

Conclusion
Looking at the results, it can be said that this project was a success. Five AI players were evolved that could play decently, even defeating experienced human players (i.e. me) more than half of the time. But that's not to say that the results couldn't have been better.

First of all, it was found that the Pareto optimal front was not necessary in this project. Upon viewing the finally evolved players, it can be seen that most of them could survive indefinitely. Indefinitely for at least 1440 pieces! But this is not the main reason why the front wasn't needed. In fact, there were two cases that did not max out at 1440, but that's besides the point. The point is, the p-fitness itself reflects the p-fitness inherently by its value. The larger the s-fitness, the larger the p-fitness will be. Therefore, the s-fitness could have been excluded from fitness calculations completely and the Pareto front done away with.

But there is a second, better way to calculate fitness that would have required the optimal front. If the p-fitness was not a simple sum of all the rocks sent over but a ratio of rocks sent over per piece, then the survival fitness ceases to be inherently a part of the p-fitness value. This method of calculating fitness is being brought up since there is another problem noticed in the final results that may require that a front remain in place should modifications be made.

The problem is that the final AI players that were generated do not take into account, at least not to the level they should be, the height of a column. If you were to play the AI players, you would see that they will often stack up pieces all the way to the top, even with open spaces to lower positions in the pit available. I can only attribute this shortcoming to the fact that these players were trained to play by themselves. They were not trained in a hostile head-to-head environment where rocks are being sent by the opponent constantly. These AI players were good at keeping their pieces low throughout their solo play, so they never experienced the high pressure situation where pit space was getting scarce and to survive, height had to be taken account.

As a solution to this problem, a rock-sending mechanism could be placed in the training sessions, forcing the players to learn how to avoid high columns. The mechanism could send a couple rocks down in random locations every piece, or the mechanism could be more responsive to how the AI is playing. The danger to avoid when using a more dynamic rock-sending mechanism is that it may cause the GP players to evolve certain unwanted behaviors. For example, if it sends rocks down in abundance only when pieces in the pit are at a low level, then the players may evolve to avoid chain reaction strategies and only seek to group pieces quickly to get them out of the way quickly.

There are many other improvements or even totally different approaches that could lead to better PuyoPuyo players, but for a start, the results from this project are good, not only for creating competent players, but also opening new ideas to incorporate into the following "generations".

References
[1] Horn, J., Nafpliotis, N. and Goldberg, D. E. (1994) A Niched Pareto Genetic Algorithm for Multiobjective Optimization. First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, Volumn 1, 1994. pp82-87.

[2] Janikow, C. Z. (1993) A Knowledge-Intensive Genetic Algorithm for Supervised Learning. Machine Learning, 13, pp.189-228. Department of Mathematics and Computer Science, University of Missouri, St. Louis.