Thursday, March 15, 2018

Predicting March Madness with Elo ratings and Monte Carlo simulations

I was recently invited to join a March Madness bracket pool at my office. I quickly realized that this was a great opportunity to apply some of the concepts that I learned about in reading Kasparov's recent book Deep Thinking, discussed in my previous post. My original idea was to apply Elo ratings, originally developed for chess, to inform my bracket. The scope of this effort quickly grew as I realized that I could wrap additional layers of abstraction around an Elo calculator, incorporating Monte Carlo simulations and sensitivity analysis, to make a fully-automated bracket generator. In this post, I'll discuss my approach, method, and results.

About Elo ratings

Elo ratings were originally created for the purpose of quantifying a player's skill level in chess over time relative to their peers. The Elo rating system has many nice properties, such as the ability to predict game outcomes based on prior skill levels, and its drop-in/drop-out asynchronous nature. For its simplicity and power, the rating scale has found use in many sports beyond chess. You can read about the math behind it on Wikipedia.

Graph showing relationship between Elo difference and win probability; source

Elo vs. NCAA RPI

In ranking its teams, the NCAA uses a ranking system called RPI, or rating percentage index, which is an oddly arbitrary system. Historically, it seems to have good predictive power, but this may be somewhat deceiving. The act of receiving a high seed seems to be inherently advantageous regardless of team skill due to the bracket structure, 
as Jon Bois humorously shows:

The primary free parameter in the Elo rating system is its so-called K factor, which controls the speed at which ratings respond to apparent increases in player skill. In comparison, the RPI system has three weighting parameters, which are more or less arbitrary, and whose meaning is less obvious in a higher-level sense. As Wikipedia puts it, RPI "lacks theoretical justification from a statistical standpoint" and the "heavy emphasis upon strength of schedule [may give] an unfair advantage to teams from major conferences". My analysis supports this criticism of the RPI system - more on this below. More critically for this effort, it also has limited predictive power, so in my model, I use Elo exclusively and neglect NCAA ranking except as it influences the initial bracket structure.

Generating Elo ratings

Elo ratings are generated by examining win/loss records. I decided to limit my scope to only the 2017-2018 season, neglecting any prior data, to maintain simplicity and avoid the additional modeling assumptions that would be required.

I scraped a complete game log for all of the active teams and parsed them to remove duplicates and non-D1 games. In total, the scraper extracted 5,401 unique games between 351 D1 teams spanning 2017-11-10 to 2018-03-10. I also manually added the play-in "First Four" games, which are effectively regular-season games. For those interested, this dataset is available on my GitHub.

With a log of all unique games, calculating Elo ratings is straightforward. Teams are initialized with an average score, and their score is adjusted as they win and lose games according to Elo's equations. In essence, the equations predict the probability of each team winning based on their rating difference, then compare the actual result to the prediction, then update the teams' ratings accordingly. Its simple and self-correcting nature reminds me of Bayes' Theorem, another useful statistical tool for making predictions based on observations.

Elo over 2017-2018 season plus First Four, selected teams highlighted

Comparing Elo and NCAA rankings

Elo provides a check against the NCAA ranking system. Which teams are most overrated? Most underrated? Here is what the data says about the NCAA's top 25 teams:

Green: underrated; red: overrated, per Elo assessment
For Elo, K = 25 and mean = 1500; NCAA source: Google

Interestingly, the comparison produces a jarring collection of strong agreements and strong disagreements. Most overrated is Florida, ranked at #23 with a 20-12 record. Most underrated are Houston and Saint Mary's (CA), ranked #21 and #25, with records of 26-7 and and 28-5, respectively. The systems agree that Virginia and Villanova are the strongest teams. Here are my top 25 teams sorted by regular-season Elo:

Notably, Saint Mary's (CA), ranked #16 in Elo, did not qualify for the Round of 64 - a victim of the NCAA's archaic ranking system.

Determining postseason probabilities

Elo's equations combined with teams' ratings can be used to calculate the probability of any team beating any other team. 
For example, in the first game, Virginia, rated 1761, plays UMBC, rated 1604. Per Elo, Virginia's win probability is:

[1+10^((1604-1761)/400)]^-1 = 71.2%

So Virginia is most likely to win. At this point, a call to a random number generator bounded on 0 to 1 can be used to simulate a game. If the result is < 0.712, Virginia wins, otherwise, UMBC wins. So Virginia is most likely to advance here, but we also can't rule out an upset either. How to track both outcomes? For this, I turned to Monte Carlo simulations. This allows us to re-run this game, and all downstream games, an arbitrarily large amount of times until a coherent image of overall probability emerges. The result is the probability of any team reaching any round, encompassing both team skill and bracket structure. FiveThirtyEight has done a typically excellent job of visualizing this. My own visualization of the same data is shown below, though our models and probabilities differ.

As a clerical note
, I experimented with two approaches for dealing with Elo ratings after simulated games: "static" Elo, where the ratings are unchanged after games, and "dynamic" Elo, where the ratings are updated based on the simulated outcome. The difference between the two ended up being fairly negligible - more on this later.

Incorporation of Monte Carlo simulations

Each game in March Madness can have two possible outcomes, and there are 32+16+8+4+2+1 = 63 games in total. So the total number of brackets is 2^63, or 9.22 x 10^18. This enormous state space makes exhaustive analysis impossible, and also explains why Warren Buffett has not yet had to pay out his jackpot prize for a perfect bracket, and is not likely to. However, exhaustive analysis, even if it was possible, is not necessary - Monte Carlo simulations provide a sufficiently accurate approximation.

My program operates at a speed of about 5,000 tournaments per second on a Dell Precision 5520. It monitors the maximum probability of any team becoming champion, as a figure of merit, and considers the probabilities converged when this value changes less than 0.025% between 5-second intervals. This takes around 100,000 tournaments, or 20 seconds.

Visualizing probabilities

Click to enlarge - Elo K: 25, Elo mean: 1500, dynamic Elo in postseason

Colors: green denotes above-average probability for that round; red is below-average 

The probabilities are then sorted according to each team's probability of winning the championship - more why the probabilities are sorted this way below. The low probability of the most likely outcomes - e.g. the strongest teams have only a 5-10% of winning the championship - makes clear how difficult the outcome is to predict. The ultimate odds are relatively slim for all but the top teams, and absolutely slim for all teams.

Down-selecting to a single bracket

In moving from probabilities to a final bracket, I considered how the bracket will be scored and more generally, what the ultimate goal is. In my case, my goal is to maximize my bracket's score, as scored by ESPN, so I can win the office pool. ESPN's rules indicate:
  • Round of 32: 10 points per pick
  • Round of 16: 20 points per pick
  • Round of 8: 40 points per pick
  • Round of 4: 80 points per pick
  • Round of 2: 160 points per pick
  • Round of 1: 320 points per pick
This scoring system results in the optimal strategy being a bit counter-intuitive. Suppose every game is decided at random with a coin toss. The probability of correctly selecting the champion, worth 320 points, is 1/64, or 1.6%. One round upstream, the probability of correctly selecting both teams in the Round of 2, also worth 320 points, is (1/32)^2, or 0.1%. The percentages get progressively worse further upstream. In short, the expected point yield gets worse the further upstream you target, so the optimal strategy is to target as far downstream as possible - in effect, to build your bracket in reverse chronological order.

For my bracket, I use overall probability of winning the championship as the metric to determine which teams advance. If the rounds were equally weighted, or equivalently, if I was competing for overall percentage, I would simply advance the team with the greater Elo rating in each game, and Monte Carlo simulation would not be necessary.

In summary, my bracket does not represent my best guess at the tournament's outcome, but rather a series of choices made to maximize score - an important distinction. 

Sensitivity analysis

To probe the sensitivity of my results to my modeling assumption, I wrapped an additional layer of abstraction around the bracket generator and modified three of the key model parameters, borrowing historically stable values from chess to guide parameter ranges:
  1. Elo mean - 1000 and 1500, the two standard values I've seen in literature
  2. Elo K value - 10, 25, and 40, per FIDE's standards
  3. Elo rating in playoffs - either "dynamic" or "static"
Since it wasn't obvious which of these models was best, I decided to analyze all permutations, 12 in total, and weigh them equally. In other words, each bracket gets a vote, and the majority vote wins.

In general, the models are in unanimous agreement about the best choice for most bracket positions, including 
from the Round of 4 onward, and with only a single dissenting vote in the Round of 8. Further upstream, minor disagreements creep in progressively. After extracting a consensus from the brackets, I surveyed which, if any, models produced a bracket that exactly matched the consensus bracket - in other words, which model was most "centrist". I found that my initial set of model parameters - Elo mean: 1500, Elo K: 25, dynamic postseason Elo - provided an exact match. So the sensitivity analysis confirms that these settings are reasonably stable.

My bracket

Bringing it all together, here is my bracket:

Source Code

Full source code and data logs, minus the game scraper, are available on my GitHub.


Update: April 4, 2018

March Madness is now complete. How did my bracket do?

Better than average globally, but not as well as I hoped. It scored 680 points out of a possible 1920, ranked #9 of 16 in the office pool, and was 63rd percentile per ESPN.

distribution of ESPN bracket scores from various pools

What does this mean for my bracket program? In short, it's hard to say. In the presence of March Madness' trademark unpredictability, a poor result does not necessarily indicate a poor decision-making process, nor does a good result indicate a good decision-making process.

This year, UMBC upset Virginia in the first round, the first time in tournament history that a #16 seed defeated a #1 seed in the postseason. My program predicted Virginia had a 71% chance to win; FiveThirtyEight had them at 98%; Ken Pomeroy 97%. Had Virginia instead won this match and continued to win the championship as I predicted, my bracket would have scored 1310, or 99th percentile. This is wishful thinking, of course, but it illustrates how sensitive the results are to flukes, upsets, injuries, or a team simply being "off" or "on".

More succinctly, The Harvard Sports Analysis Collective wrote in 2011:

"More generally, almost all prediction methods make the dubious assumption that NCAA Tournament games are the same as regular season games. That does not seem to hold true. NCAA Tournament games are played in bigger arenas, under brighter media spotlights, and with higher stakes than almost any regular season game."

When it comes to bracketology, I've learned that there seems to be no end to the number of metrics and methods that armchair statisticians like myself have devised in an attempt to crack this inscrutable problem. The next steps I see for this are refining the prediction method using historical regular season and tournament outcomes, and also evaluating other sports, particularly baseball, which has a longer regular- and post-season, and so is a better candidate for statistical analysis.

No comments:

Post a Comment