Playing With Win Probability Models

I recently developed a win probability model for the awesome py_ball package in Python. The package itself makes NBA/WNBA data accessible to a wide audience. If you haven’t seen it, you should definitely check it out. The link is https://github.com/basketballrelativity/py_ball.

In this blog post, I’ll describe the methods I used to develop the model.

Methods

Our model heavily relies on a series of logistics regressions, which are dependent on (a) the amount of time remaining in the game (b) the point differential and (c) who has possession. As of right now the only bias we introduce at the beginning of the game is home court advantage, which is why the home team always has slightly better odds than the away team. This is because we feed everything into the model with respect to the home team, so the model learns that the home team has a slight advantage. We are hoping to add betting odds to find true pre-game win probabilities.

In order to develop the model, we use a method that Brian Burke used in his win probability models, splitting up the game into multiple groups.

We split the game up into 960 groups (one group every 3 seconds), where we run a separate logistic regression each. Each logistic regression takes in the point differential and who has possession. We do not need to explicitly input the time, because each model is only trained on a specific timeframe.

Logistic Regression in Machine Learning using Python | Towards ... — Graphical representation of logistic regression

For games that go into overtime, we treat the 5 minutes left as if there are 5 minutes left in the fourth quarter. This is to ensure that there is enough training samples for the model to actually learn something. For instance, there are very few games that go into 4OT, so a logistic regression model would not actually be able to recognize any trends with a lack of data.

The model is trained on 5 seasons worth of data from 2013-14 to 2017-18 games.

Results

We evaluated our model on the 2018-19 data, using a brier score.

The brier score is the average of the mean squared error for every time frame. For instance if the model predicts a 0.58 probability of winning at a given time and that team won at the end of the game, we add (1-0.58)^2 to the brier score. We add all of these values for the entire game and divide by the total number of events. There is one event every 3 seconds.

We received a brier score of 0.167 for our model. This is a fairly decent value, because this means, on average, we are predicting the outcome of the game correctly.