NFL Predictor (by Anastasio E. Kesoglidis & Patrick L. Johnson)

NFL Predictor
Anastasio E. Kesoglidis & Patrick L. Johnson

Introduction

Both of the authors of this program come from an avid sports fan background. When the list of
suggested projects included an NFL game predictor, we both quickly agreed that this was what we
wanted to work on. Having seen the power of learning during the semester in the domain of face
recognition, we thought that it would be both interesting as well as fun trying to develop a system
that could learn to predict the outcome of NFL games. Thus our problem was - Can we create an
accurate NFL game predictor that uses learning? To create such a system, we gathered statistical
data from past NFL seasons and used them as our data set. The system learned which statistics
were more important than other by comparing teams that met in the playoffs of past seasons and
noting which statistics the winning team was better at. The system was tested on the 1998 NFL
season and it made predictions about the upcoming season.

Approach

Usually, intelligent systems in sports domains rely on statistical data from previous seasons. It
was clear from the start that if we wanted to accurately pick a winner between two teams, we
would have to look at two things - the players that are on the team and their respective statistics in
a previous season. We decided not to look too far back into the past as a player’s current ability
most closely resembles their performance from two or three seasons ago. Thus, we decided to
focus on the three most recent season statistics - 1996, 1997, and 1998.

After going through a pool of different statistical categories related to each player’s position, we
picked out what we considered was the most important aspect for each position. In the end, we
came up with 27 statistical categories that spanned 8 different positions. For example, we decided
that the following 7 statistics were the most important for quarterbacks - number of games started,
number of passing attempts, percentage of passes that were completed, average number of yards
per attempt, number of touchdowns thrown, number of interceptions thrown, and their rating.

Because of the fact that the statistical data was in different files according to position, we had to
create a module that would go through each of these files and create a roster for each team. For
example, the module had to go through each statistical file, and read in all those players who
played for a particular team. Thus, this module simply brought all of the players on a particular
team together for us into a TEAM structure. This gathering of players was done for each of the
three seasons we used because of the fact that some players change teams after a season. During
the gathering of the players for a particular team and for a particular season, certain statistics for
certain players were combined while others were eliminated to create team statistics for each
position. For example, combining the total number of sacks that each player had on a team told us
how good a team was defensively as a unit, not as individuals. This was not the case for every
player. Certain positions are more individualistic than others. For example, a really good
quarterback means a great deal to a team whereas, because the defensive line of a team is made up
of several players, one good player on the line can be offset by one bad one.

For the next phase, we had to decide how we wanted to use this gathered data for purposes of
camparing two teams. We came to the conclusion that, we had to decide which of the 27 gathered
statistics were more important than others. For exmaple, are the number of field goals made as
important as the number of touchdowns a running back has? To address this problem, we took the
gathered statistics for each team in a particular year and ranked them in each statistical category.
For example, the team that had the most quarterback touchdowns was given a ranking of 1 in that
category, the next 2, etc. Teams that had the same number for a statistic were given equal rankings
and teams for which no data was available for a statistic were given a ranking 0 (not to denote a
bad ranking, but no information).

Thus we had a rankings systems for each team for each statistic for three seasons. Breaking it
down into rankings made the statistical data somewhat more comparable since teams that had
relatively close numbers ended up with close rankings and those that didn’t had big differences in
rank. From this data we decided that we were ready to “learn.”

The approach we chose for learning was to first take the compilation of statistical data and assign
them all a weight of 1 to start. Our ultimate goal was by the time we finished our learning was to
have all the weights updated to values such that the numbers captured what was really important
about winning games and devalued meaningless stats. To that end, we looked at playoff games and
compared the winning team to the losing team. In statistical areas where the winning team was
stronger we increased the magnitude of that weight and where the team won despite being weaker
we decreased the value. The hope was that over the course of lots of games, the effect of “flukes”
would be dissipated and that the important factors would show themselves. The Super Bowl was
weighted more heavily than the other playoff games because we felt that the team that won the
Super Bowl really had something that the other team was missing that pushed them over the top.
After all the weights were compiled based on the data and adjusted based on how much they helped
teams win, we came up with power ratings for the teams. This power rating was a number that
captured the team’s strength. It was calculated by taking the numbers that the teams acheived in
the 27 different statistical categories and multiplying by the computed weight for that category,
then summing. This computation was done after all the learning. Comparisons between teams was
done based on their power rating. If teams had very close power ratings then the program chose
the home team and predicted the game would be very close. Otherwise if the ratings were within
another certain threshold it would output the winner and thought the outcome would be a regular
game. For power rating differences above this threshold it predicted a blowout for the winning
team.

We created a java applet for two reasons - simplicity in simulating a season and also for demoing
our project to the class. The applet allows you to select a home team and a road team and the year
that the game is to be played. In order for our applet to run faster, we simply hardcoded the power
ratings for each team into the applet thus not to waste time learning every time a game is entered.
The applet can be accessed online at http://www.columbia.edu/~aek19/nfl.html.

Results of running the program on the 1998 NFL season

In order to test what our program had learned, we took data from 1996 and 1997 and used that as
training data to teach the program what statistics were important in talking about what makes one
team better than another. It turned out that the most important statistics turned out to be
quarterback numbers. This is an understandable result that can be confirmed by casual football
knowledge. Armed with the knowledge of this and the other relevant weights we then got a copy of
the 1998 schedule along with the results and ran the program for each game of the season. We
recorded what our program predicted the outcome of the game to be and compared that with the
actual result of last season. Out of the 240 games, it predicted the correct winner 148 times, or
62% of the time. Details of the program’s performance during the 1998 season can be viewed
under Appendix A. While we were encouraged by the fact that the program had at least learned
something, we observed several problems that we felt might have contributed to the accuracy being
lower than we might have hoped.

One problem we had was with the data set that we used to compile team statistics. We didn’t
realize at the time we chose this data set or during the early stages while we were using it that it
counted all of the player’s statistics that they had accrued for the whole season to the team that
they were currently on, even if they had only recently been acquired by that team and had in fact
gotten those numbers largely while playing for another team. We tried to mitigate this somewhat
by using the maximum value a team had for some of the statistics, other times only a total really
made sense so we were forced to include some of the incorrect data in our program. However, this
wasn’t that common a case so we feel on the whole that the effect was rather minor.

As stated in the approach discussion, our program attempted to ascertain what made the difference
between winning teams and losing teams. We had a wealth of statistical information available to
us but our program had to try to make sense out of it and weed out the largely insignificant values
from the ones that really separated the winners from the also-rans. One problem with this that
contributed to lowering our results was the fact that we couldn’t learn from data from the current
season as we were running the program. This had the effect of making our program naive about
the current season. To evaluate the teams from the current season, all we had was previous
performance to go on. Through the course of a season, there are often developments that would
change how you would evaluate a team. A good example of this in the season we ran was the
Minnesota Vikings performance. Before the season the team had a lot of question marks. Their
quarterback was someone who had been out of the league for more than a season. They had just
signed a tremendously talented receiver who had been having trouble with the law. Their defense
was unproven. Without any prior knowledge about the season, many experts, not just our program
were doubtful the team would do very well. However, a few weeks into the season experts were
starting to change their opinions. Everything went right for this team. The quarterback came out
of retirement and played better than he ever had before. The star receiver stayed straight and
became one of the best in the league. The defense solidified and quickly cleared the doubts people
had. As it became apparent that this was one of the best teams in the league, prognosticators were
able to update their opinion on the team and replace the uncertainty with strong confidence in the
team. Later in the season they were almost always the favorite and often outscored their opponents
by impressive margins. The fact that our program couldn’t take advantage of this dynamism
probably cost it some games. However, we couldn’t get midseason data about the players so we
couldn’t make the same changes football experts and casual fans observing could. We attribute
this to the unpredictability of sports. The team in question, though talented, could have very easily
ended up being mediocre so it wasn’t necessarily a bad decision to evaluate them the way the
program did, it just didn’t work out in this case.

Another problem we had was that we sometimes overestimated talented teams that piled up good
numbers but for some reason or another didn’t put together many wins. This question has
mystified coaches and fans for some time now. Some teams seem to have all the talent they could
hope for but not be able to “put it all together”. Overall we felt the results showed that the
program had learned respectably well what made teams win and lose.

Results of running the program on the 1999 NFL season

Our decision to use the data from 1996, 1997, and 1998 to simulate next year’s season grew
mostly from curiosity. Seeing that we had a performance of 62%, we decided to see what would
happen if we added in a third data set. Although we won’t be able to see how accurate the results
from running this simulation are until next year, we thought it was worth doing. The actual
schedules for next season were used with the home teams playing at home and the road teams
playing on the road. The only slight bump that we had prior to running the simulation was that a
new team on which we had no past data on, the Cleveland Browns, were part of the season. The
simplest way of taking care of this was just to not count games that involved them. All that this
would have done, in terms of the simulation, would have been that some teams ended up with fewer
games played than others.

The results of the simulation can be viewed in Appendix B. Judging simply from the standings on
the first page, we were not too surprised with the results. All but one of the division winners (those
that have a y next to them) were playoff teams last season. There were two surprises - 2 teams that
did much better than expected (St.Louis and San Diego) and 4 teams that never won.

The reason why the two teams did good was that they had good rankings in the categories our
learning process labeled as the most important. I strongly doubt they will perform as they did here
next season, however. I was amazed that 4 teams did not win any games. I would have accepted 2
teams with no wins but I feel 4 is too much. The reason why they did as they did was because of
their schedule. All of the teams they ended up playing against were much better than they were.
Even the home field advantage in some games was not good enough to get them a win.

In our algorithm, we had a bias towards the home team. For example, if two teams with relatively
equal strengths were playing each other, we decided to pick the home team as the winner because
we felt that the home field in a game usually plays a big role in close games. Although we could
have randomized it a bit, after seeing the good teams do good, it did not seem to make that much of
a difference. (If a good team did well, they won games away from their stadium). The only place
where this seems to have caused a problem is in the simulation of the playoffs. Because teams
with better records are given home field advantage in the playoffs and because most of the teams in
the playoffs are of close strength, it was obvious that the teams that played at home won every
game in the playoffs. In our simulated Super Bowl XXXIV, the Broncos beat Atlanta simply
because I put them as the home team. If I had reversed it to Atlanta being the home team, Atlanta
would have won. Thus, between the standings and the playoffs, I think the standings will be more
close to accuracy next season than the playoffs.

Conclusions

Our program was an example of an application of some of what we learned in the course to solve
an interesting problem in a fun domain. The reasons that the project was interesting was that we
had prior interest in both artificial intelligence and sports and that the nature of the problem
allowed for immediate gratification in terms of testing and evaluation. We encountered some of the
challenges that make artificial intelligence difficult. Specifically, the real world (or even a
microcosm of it in a game) is an incredibly complex space that we are attempting to abstract away
from and infer some knowledge about. Computers and AI learning techniques allow us to make
some progress, but the basic problem of modelling the real world continues to be a challenge. We
are satisfied with what we did.

APPENDIX A

APPENDIX B