# Finding Determinants of NBA Shot Probability using Interpretable Machine Learning Methods

This is a project that I am presenting as a poster at the CMU Sports Analytics Conference. A full version of this research (and associated code) is here: https://github.com/avyayv/CMSACRepo.

You may also view the poster I created at: http://www.stat.cmu.edu/cmsac/poster2020/posters/Varadarajan-NBAShotProb.pdf

## Overview

Since the advent of basketball analytics, a metric that is accurately able to determine the relative worth of player’s defense has been widely sought after. It is widely regarded that features like shot defense are key to a player’s defensive identity, but regularized on-off metrics like RAPM are unable to take this into account. Using player-tracking data, we are able to extract information about shot defense.

We determine the relative importance of a of a set of offensive and defensive factors on individual shots near the 3 point line. Using 2015-16 SportVU data, where player and ball positional coordinates are captured 25 times a second, and the accompanying play-by-play data, we extract the following features: ‘Distance Between Shooter And Defender’, ‘Shot Distance’, ‘Difference Between Shooter And Defender Height’, and ‘3PT%’. 3PT% is calculated for the entirety of the 2015-16 season.

We then train a gradient boosting model to predict the shot success probability of a given shot. Although this can be useful on its own, it does not directly provide the relative importance of each of the input features.

To this end, we use interpretable machine learning techniques, specifically shapley values. Using TreeSHAP, we determine the importance scores for each input feature, per shot. Aggregating these values over all games in our dataset, we estimate the relative importance of each feature.

## Model

Our preliminary goal is to devise a method to statistically determine the probability of a shot being made. We use XGBoost to model shot probability. Based on a hyperparameter search, we use the following hyperparameters: learning rate=0.05, max depth=3, n estimators=100, basescore=0.45, colsample bytree=1, subsample=0.8, gamma=0. Our chosen booster is ‘gbtree’.

Although our model’s predictive power isn’t extremely strong (AU-ROC=0.56, AU-PRC=0.43), we still perform better than if we only used 3PT% to make predictions. The league average 3PT% was 0.35, so a random estimator would have an AU-PRC of 0.35. We specifically want to deduce what the model is learning within this improvement above 0.35.

## Interpretation

We are then able to interpret the model’s predictions. Specifically, we wish to concretely determine which features the model finds to be the most useful to predict the shot probability.

To this end, we use shapley values (an idea from cooperative game theory): a concrete way of “splitting up” contributions among features. Shapley values assign specific negative and positive values, which signify whether a specific trait positively or negatively the model’s predictions. The higher the shapley value for a given feature, the more the model’s prediction was affected by that feature.

In order to solve for our shapley values, we use TreeSHAP. For each given datapoint (a single shot), we able to extract the shapley values for the aforementioned features fed into the model ‘Distance Between Shooter And Defender’, ‘Shot Distance’, ‘Difference Between Shooter And Defender Height’, and ‘3PT\%’.

In the above plot, we see the average shapley value for all of the data points. Specifically, the distance between a shooter and their defender is more important than the 3PT%, Shot Distance, and the Difference between the shooter and defender height. In addition, the difference between a shooter’s height and a defender’s height has little to no significance when determining the probability of a made 3PT shot. Finally, the shot distance on a 3PT shot seems to be less significant than the Distance Between Shooter and Defender and the shooter’s 3PT%.

In the more detailed version of our shapley value plot, we are able to pinpoint the trends for each of the features. For instance, in the 3PT% plot, we notice that the higher the 3PT%, the higher the shapley value. Although this specific information is fairly intuitive, it serves as a sanity check for how our model actually was able to learn. Similarly we can determine the distribution of shapley values. For instance, there is not much variance in the shapley values for ’Difference Between Shooter and Defender Height’, while there is significant variance in the ’Distance Between Shooter and Defender’.

## Discussion/Conclusion

We believe that our ideology can help coaches adjust their strategies, optimizing for specific shooter situations. In our specific research, we hope to calculate shapley values for specific players.  For instance, if we can determine the associated shapley values for a given player on defense, the summation of all of these values across the season can bring us closer to a unified defensive statistic. This can not only be aggregated over a season, but over specific games as well, allowing us to answer questions like: “How well did Anthony Davis play on defense during game 6 of the finals?” This can also help isolate offensive achievement as well.  But beyond this research, we hope that our methods can show that shapley values area field worth exploring in sports. Whether discussing the relative importance of specific attributes on a shot, or discussing lineups as a whole, we believe that the calculation of shapley values can help us understand the relative importance of features. Our ideology is similar to that of Matt Ploenzke’s Submission to the Big Data Bowl. Specifically, we hope that our method can show the benefits to intepretable machine learning methods in general. Generally, machine-learning methods are considered to be black-box learning methods, but we believe that concepts like shapley values can help decipher these methods. This can help us understand the way that these models are learning, allowing us to better understand sports as a whole. Some improvements on this research include improving model performance and comparing our shot probability model to existing shot probability models.

# Where do assists come from? (Part 2)

A few weeks ago, I did some analysis with archived SportVU Player Tracking data (2015-16), looking at where on the court assists come from. You can read about that analysis at these links:

Here, I’m a deeper dive on this data, looking at assists off misses and comparisons by position (Guards, Forwards, Center).

In addition, you might realize that the overall distribution of assists differs slightly from my previous tweets. This is because I used a more robust method of discovering assists from the raw SportVU logs. To find a general timeframe for when assists occurred, I have been using play-by-play data. But now, I determine the exact time of shot release time using calculus (smoothing, and then taking the derivative of the ball height with respect to the ground) and then backtracking to determine the last pass to this shot. This led to a significant increase in the number of assists in my dataset (around 2x more assists from before), leading to a change in the overall distribution.

### Passes by shooter location

Here, we see a grid of shooter location and whether the shot was made or missed.

You can see that the distribution of assists for misses is quite similar to that of makes.

As we often see when we watch film, the corner-3, kick-out assist is widely popular in the league. Here’s an example.

This is an example of what makes the kick-out assist on the corner-3 so effective. Cody Martin drives in, attracting 3 defenders leaving PJ Washington wide open in the corner.

#### Splitting up by position

In general, from these graphs, it seems like Guards make assists farther from the hoop that Forwards and Centers, which is somewhat expected (based on offensive positioning bias).

However, I found it interesting that Guards pass more often from behind the 3 point line that Forwards.

This Jamal Murray assist clip is a good example of an action that guards perform more often than forwards (on above the break 3’s)

Here we see Jamal Murray getting double-teamed, Jokic setting a screen, eventually leading to Jamal Murray making a pass from behind the 3 point line. Will Barton fakes out Willie Cauley-Stein, eventually leading to an open look from 3.

Similarly, this clip highlights Forwards’ tendencies well.

And here we see a slightly different above the break 3 assist. PJ Washington makes a pass that resembles a kick-out corner-3 assist to Jalen McDaniels. Jalen McDaniels makes a cut toward the top of the key attracting two Miami defenders, leading to an open above-the-break 3 for Caleb Martin.

### New Player Graphs

As I stated above, my new way of determining assists from the raw SportVU data is more robust. As a result, there is more data and gets us closer to the actual distribution from the season. The reason why I still am not confident in stating that these are the true distributions is because (a) Not all games from 2015-16 were in my dataset of raw logs and (b) I still am unable to extract all assists from the logs that I have. The players which I have placed below all have 80+ assists mapped for both made and missed shots. The original heat maps which I plotted were about 50-60+ assists per map. I will also be releasing my post-processed CSV file with all of the mapped assists at the end of this post.

But you can see that we need a lot of data to actually make real conclusions. The individual player graphs could be a little bit more helpful if I had more data, but for right now I’d mainly trust the league-wide and position specific graphs.

As we saw from my graphs from my original tweet, the overall distributions of some players seem to have changed a bit from before. These are definitely getting closer to the true distribution of where assists come from.

(Updated). Below each of the graphs, I have also added the distributions of assist from my tweet https://twitter.com/avyvar/status/1267189790388056064 for comparison purposes.

#### Ricky Rubio

Current Distribution (more data, more accurate)

Old Distribution (less data)

#### Stephen Curry

New Distribution (more data, more accurate)

Old Distribution (less data)

#### Chris Paul

New Distribution (more data, more accurate)

Old Distribution (less data)

#### LeBron James

New Distribution (more data, more accurate)

Old Distribution (less data)

## Conclusion

For one, the distribution of assists on made shots vs. assists on missed shots are quite similar (league-wide). When we look at individual players, there seem to be differing distributions. But again, I would refrain from making any judgement about individual players based on this data.

In addition, there does seem to be variation between assist heat maps for different NBA positions.

One topic that interests me is defensive positioning on made/missed and assisted/not assisted shots by shot location (will be writing another post about this one soon, there are some interesting results).

If anyone has any suggestions my twitter is @avyvar and my email is avyayv@gmail.com.

Here is a repo containing the assists (in CSV format) https://github.com/avyayv/mappedassists. Let me know if you’d like me to clarify anything.

## Acknowledgments

Thanks to Dean Oliver (@DeanO_Lytics), Todd Whitehead (@CrumpledJumper), and Patrick McFarlane (@py_ball_) for their feedback on this project. A lot of this post was based on their suggestions.

I found the videos for each of the plays using 3ball.io.

# Playing With Win Probability Models

I recently developed a win probability model for the awesome py_ball package in Python. The package itself makes NBA/WNBA data accessible to a wide audience. If you haven’t seen it, you should definitely check it out. The link is https://github.com/basketballrelativity/py_ball.

In this blog post, I’ll describe the methods I used to develop the model.

### Methods

Our model heavily relies on a series of logistics regressions, which are dependent on (a) the amount of time remaining in the game (b) the point differential and (c) who has possession. As of right now the only bias we introduce at the beginning of the game is home court advantage, which is why the home team always has slightly better odds than the away team. This is because we feed everything into the model with respect to the home team, so the model learns that the home team has a slight advantage. We are hoping to add betting odds to find true pre-game win probabilities.

In order to develop the model, we use a method that Brian Burke used in his win probability models, splitting up the game into multiple groups.

We split the game up into 960 groups (one group every 3 seconds), where we run a separate logistic regression each. Each logistic regression takes in the point differential and who has possession. We do not need to explicitly input the time, because each model is only trained on a specific timeframe.

For games that go into overtime, we treat the 5 minutes left as if there are 5 minutes left in the fourth quarter. This is to ensure that there is enough training samples for the model to actually learn something. For instance, there are very few games that go into 4OT, so a logistic regression model would not actually be able to recognize any trends with a lack of data.

The model is trained on 5 seasons worth of data from 2013-14 to 2017-18 games.

### Results

We evaluated our model on the 2018-19 data, using a brier score.

The brier score is the average of the mean squared error for every time frame. For instance if the model predicts a 0.58 probability of winning at a given time and that team won at the end of the game, we add (1-0.58)^2 to the brier score. We add all of these values for the entire game and divide by the total number of events. There is one event every 3 seconds.

We received a brier score of 0.167 for our model. This is a fairly decent value, because this means, on average, we are predicting the outcome of the game correctly.

### Comparison

The following examples are comparisons between our model (top) and inpredictable.com’s model

Kobe Bryant’s last game (LAL vs. UTA 2015-16)

### Usage

The model will be available at https://github.com/basketballrelativity/py_ball. Example notebook using the win probability model is here https://github.com/avyayv/winprobability/blob/master/pyballpackage.ipynb.

# Where do assists come from? (Part 1)

I recently tweeted some assist heat maps that were generated using 2015-16 SportVU data here.

Although the individual player heat maps are interesting, I wanted to look at more league-wide trends. I also wanted to explain my methods a little bit more.

### Why?

The reason why I found this specific problem interesting was because of its potential implications.

Players in the NBA and all of basketball have inherent bias for where they prefer shots. For instance, if a player like Ben Simmons were standing at the 3-point line, you wouldn’t guard him as tightly as you would Stephen Curry. Essentially, you could adjust coaching strategy if you better understand player tendencies.

Analysis of the specific locations of which players prefer to pass and shoot could prove useful, as it would help players anticipate what could happen next in a specific play. This would eventually improve defensive strategy for teams.

### Methods

As I stated above, I used the 2015-16 SportVU data for the generation of these graphs. This data captures every single player on the court and the ball ~25 times every second. Although it would have been optimal to have multiple seasons’ worth of data, I was unable to find it.

I cross-reference this SportVU data with play-by-play data from stats.nba.com to determine when assists occur.

Then, using an approximate timing of the assist event from the play-by-play data, I record the location of both the passer and the shooter. This entire process is done with pandas and Python.

After this, I use a KDE or kernel density estimator from Seaborn to generate the heat map plot. The KDE allows us to use our sparse data into more of a continuous spectrum for better visualization.

The visualization is heavily based on the post at http://savvastjortjoglou.com/nba-shot-sharts.html on visualizing NBA shot charts.

### Overall League

As stated in the caption, the left image is a heat map of all the assist locations in my limited dataset. The right image is a visualization of just the shots off of those assists.

Right off the bat, there seems to be more variation in the locations in which players take shots than where they pass from. This makes sense, as point guards are typically the ones assisting the ball, most of whom stay around the top of the key.

Clearly, based on the dataset, “drive-and-kick” assists aren’t as common as the normal, top of the key assist.

Further, as expected, we see that players standing in the corner are more likely to shoot than make a pass.

### How good is the generalization?

The above charts show that the generalization above does not capture the variability per player.

Even for players who play the same position (PG): Eric Bledosoe, Stephen Curry, and John Wall, there are stark differences between their individual assist charts.

It was interesting to me that Steph has a tendency to assist from the right side of the court, while John Wall has a tendency to do so from the left side. However, with the limited size of the dataset I was working with, it’s possible that the data I was given does not capture the full picture.

What can we do with this information? Well, we clearly see that Eric Bledsoe is more likely to pass when he’s in the paint versus when he’s at the three-point line. If a coach is able to adjust his strategy of how to defend a player like Bledsoe, it would likely improve that team’s overall defensive numbers.

To me, LeBron specifically was especially interesting. It seems as though, of all of the stars with significant assist numbers, LeBron has the most unpredictable assist locations.

This is one of the aspects of LeBron’s game that makes him such a difficult player to defend. Not only can he shoot and pass, but he can do both of these actions pretty much anywhere on the court.

Not only that, but teams also have stark differences overall. Above are 3 different teams, the Warriors, Cavaliers, and Grizzlies. All of these teams have vastly different ways of play, and as a result have different locations in which they pass.

### Future Work

In the future, I want to be able to apply these similar types of visualizations to other seasons’ tracking data. As stated here, it would be interesting to see how players/teams change over time.

I also think it might be interesting to run a sort of clustering algorithm on this data combined with shot chart data, to identify types of players.

If you have any suggestions on what else I could do with this information, please let me know through email (avyayv@gmail.com) or Twitter (@avyvar)

# Elam Ending Analytics

With the NBA season being postponed, there has been a lack of basketball in the world. As a result, I thought it would be interesting to look into depth about how the Elam Ending has a place in the current NBA and how it would work.

### What is the Elam Ending?

If you didn’t watch the All-Star Game in 2020, the Elam Ending is an idea where each team at the start of a period has a target score rather than fighting against the clock. Rather than having a 5 minute overtime or a 12 minute fourth quarter, each team would have to score a certain number of points, based on the higher score in the game.

For instance, if Team X had 75 points and Team Y had 70 points at the end of the third quarter, the target score would be some number of points above team X’s score. In the All-Star Game, this number of points was 24, in memoriam of Kobe Bryant. If 24 was used in this hypothetical game, the target score would be 99, and the first team to reach 99 would win the game.

A more in-depth description of the Elam Ending can be found here.

### Overtime

Applying the Elam Ending to overtime in the NBA has been widely suggested by NBA fans. In fact, Daryl Morey, the Houston Rockets’s GM, supports the implementation of the Elam Ending as well. It seems like a perfect, non-intrusive way of applying the idea to the NBA. As a result, we will investigate how the Elam Ending in overtime would work in today’s NBA.

#### How many points till the target score?

Teams have scored, on average, 10.3 points per overtime period in the league from 2011-12 to 2019-20.

The above graph describes the year by year points in overtime for teams. It is evident that the number of points being scored in overtime is increasing each year, due to the rise of three point shooting and efficient basketball. As a result, if we would like to maintain roughly the same amount of game time, we should have the target score be 11 points from the score in regulation.

#### How would win probabilities change?

For the purposes of this article, I will be using the following probability values to examine how the Elam Ending would change things. Thanks to Mike Beuoy (@inpredict on Twitter) for providing these values so I didn’t need to find them myself. In addition, for my comparisons to the timed overtime period, I use http://stats.inpredictable.com/nba/wpCalc.php.

This graph gives the frequencies of points scored on a given possession. During the 2019-20 season, teams scored zero points on a possession 50.5% of the time, 1 point on a possession 3.1% of the time, etc.

I wanted to examine this in depth, so I started from the beginning of a play with the jump ball. I tried to answer the question: how much does the jump ball affect the outcome of the game?

Using the probability distribution described above, I ran 1 million simulations of an overtime period, going up to 11 points.

The team that won the jump won the game ~54.4% of the time, while the team that lost the jump won the game ~45.6% of the time. This implies a ~4.4% advantage for winning when game when your team wins the jump ball.

Comparatively, the probability of winning the game given a 5-minute overtime period is ~0.542, negligibly lower than the Elam Ending probability. As a result, it does not seem that the importance of the jump ball changes with the implementation of the Elam Ending.

The next step is to see how the probability of winning a game in the Elam Ending compares to the probability of winning a game in regular overtime.

The first thing to realize is that the win probability in normal overtime is a function of the score differential and the amount of time left in the game. In comparison, the win probability in the Elam Ending is a function of each team’s score and the number of points to the target score.

A major difference that we realize is that as the time approaches 0 in a normal overtime period, the probability of winning a game approaches 1 or 0 with few exceptions. In contrast, with the Elam Ending, the probability of winning the game does not approach 1 or 0, as the team score is not a continuous variable like time is.

This nature is, in part, what makes the Elam Ending so exciting. It makes the losing team always feel like they have a chance, which leads to good play throughout the overtime period.

For instance, if there are twenty seconds left in a game and it is a four point game in a regular overtime period, the game turns into a free-throw shooting game, which very likely leads to the leading team winning the game. This is also not an exciting game to watch.

Rather, if the score is 6-10 in an overtime period with the Elam Ending, there exists a higher probability that the losing team wins the game. This makes the game far more fun to watch and it prevents intentional fouls.

Below are two graphs highlighting the win probability of a team leading by 4. The x-axis on the normal OT graph is the amount of time left in the game, while the x-axis on the Elam Ending OT graph is the game score.

It is evident on this graph that the win probability for the winning team with the Elam Ending is not continuously increasing.

This can be explained through the following example.

When it is 6-10 vs 5-9 in the period, the losing team is closer to the target score than before. For the winning team, if they take a field goal, they still have the same probability of making it as before. As a result, the winning team does not gain any advantage, while the losing team gains an advantage.

In addition, it is evident that the win probability never approaches 1 in the Elam Ending. This means the game is harder to predict, again, making it more fun to watch.

#### What types of shots should you take?

I also wanted to look at what types of shots the winning teams were taking when they won the game. Obviously the winning team will score more points than the losing team, but what areas of the game was the winning team exploiting.

Below is a comparative bar chart which highlights how many 1 points possessions each team was having, how many 2 point possession each team was having, etc.

Not surprisingly, the winning team was scoring more 3’s and 2’s than the other team. Based on the difference in the heights of the winning and losing bars in each of those categories, teams, on average, outscore their opponents ~3 points more on 2’s and ~3 points more on 3’s. As a result, based on the average shooting tendencies of an NBA team, 3’s and 2’s are equally important in the Elam Ending.

However, it has been proven time and time again that shooting 3’s usually generates more points per shot. As a result, it is obvious that making a high amount of 3’s could prove useful in any game. However, there is also a point where taking more threes is detrimental in the Elam Ending.

In this graph above, I assume that the 3 point shooting percentage is 35% and the 2 point shooting percentage is 50%. Based on these numbers, it seems that you should shoot 90% of your shots from 3-point range in order to maximize your win probability. Of course, that number does not mean very much, as in-game dynamics such as defense could drastically affect this value.

### Conclusion

Although the Elam Ending is nontraditional when it comes to professional basketball, the implementation of the rule in the NBA would make games more exciting. It would introduce more randomness to the game, and have fans holding their breaths until the final shot.

# Clustering NBA Shot Charts (Part 2)

My previous blog post showed how cluster-able NBA shot charts were. I recently made a few improvements to the model and looked into things that I didn’t look into in the previous article.

A quick summary of that article is that I generated a 14 dimensional vector with shot frequencies for different locations on the court. Then I ran k-means clustering on this vector for each player over a season.

Most of the methodology is the same between the two, so please read the other article for more depth.

## Number of Clusters

In my previous iteration, I used 3 clusters. However, I generated a plot that aimed to find the optimal number of clusters. Using the ‘elbow-method’ for k-means clustering, I found that the optimal number of clusters was probably a bit more, around 5.

## Clustering Results

After running the clustering algorithm, these were 5 example shot charts for each cluster.

Since we added more clusters, I interpreted what each of these clusters meant.

Cluster 0 seems to represent players who mainly shoot in the paint, but can shoot outside the paint. They don’t shoot many threes. My assumption is that these players used to be traditional big man but are in the transition of becoming stretch forwards.

Cluster 1 seems to represent players who shoot threes and shots in the paint (Moreyball ideals). However, they seem to shoot more threes than paint shots.

Cluster 2 seems to represent players who prefer to shoot midrange shots.

Cluster 3 seems to represent players who play in the paint and leave the paint extremely rarely.

Cluster 4 seems to represent players who shoot threes and shots in the paint (Moreyball ideals). However, they seem to shoot more paint shots than threes.

These are some of the notable players from each of the clusters. Interestingly, LeBron James and Joel Embiid are in the same cluster. Obviously they are not the same type of player, but their shooting tendencies are quite similar. This is why adding something like assist data could be beneficial to the performance of this model.

I was curious so I looked at the Rockets’ distribution of clusters for 2018-19 and this is what I got.

In comparison, this is what the Knicks were.

This highlights that the Rockets really rely on Moreyball a lot (fitting :D), and mainly focus on the three-point aspect of their strategy. Further, the Knicks distribution shows that the Knicks aren’t that progressive in their methods (we knew that).

I then cross referenced the cluster with some statistics to see which clusters relied on the ball a bit more.

These two charts show how the midrange cluster tends to have more opportunity than other clusters. Personally, I believe this has to do with the close correlation between people who shoot from midrange and their reliance on isolation basketball. Players like Kevin Durant, Jimmy Butler, and Carmelo Anthony all fall into this cluster and they are known for playing isolation basketball.

I also cross referenced the clusters with some player statistics, like three point percentage and field goal percentage.

These two graphs help us see that cluster 1 shoots a lot of threes, as they have a higher three point percentage than all the other clusters, but a lower field goal percentage. Further, we can confirm that cluster 3 is the “traditional big man” and is full of extremely poor three point shooters.

Interestingly cluster 2 and cluster 4 have similar percentages for both three point percentages and field goal percentages. However, cluster 2 shoots less threes and more mid-range jumpers, which in general, is less efficient. This is highlighted with EFG% below.

Cluster 0 is also quite poor at shooting, but they still venture out of the paint more often than cluster 3. When we watch Giannis and Anthony Davis play, we can easily identify this, as we know they are trying to expand their game to the three point shot. However, they are not that efficient from the three point line at the moment.

These graphs also further confirm that midrange players are the least efficient shooters in terms of EFG% and that traditional big men (or merely players who don’t deviate much from the pain) are the most efficient in this sense.

## Cluster Distribution Over Time

In the previous blog post, I generated different clusters for each of these years. However, I thought it would be interesting to use the same clusters and see how the distribution of the clusters changed over time.

We see that cluster 2 used to be the most popular for many years. However, with the rise of Moreyball and efficiency, we see that cluster 1 and 4 have become more popular in recent years.

The distribution of the clusters, interestingly, did not change much from 1999-00 to 2008-09. Over this entire timeframe, the number of midrange players decreased slightly, but it is not noticeable. Only recently do we see this complete change in the distribution of clusters.

## Future Work

I want to see if I can correlate these clusters to win percentage in some way. This way, we can see what clusters directly translate to winning. I also want to add other mapped data (such as where assists were made from, where rebounds were taken) and see if this helps better cluster players.

You can view all of the players in each of the clusters here https://docs.google.com/spreadsheets/d/1OphZnMi5a0vYPI_QZ1q8mRANT68oocZVQeKJIAK6nv4/edit?usp=shar

# Clustering NBA Shot Charts (Part 1)

## Methodology

In the NBA, we often assign labels to players, not really looking in depth on what constitutes these labels. Something that we can do to figure out the “definition” of these labels and see whether these labels actually exist is to use an algorithm known as k-means-clustering to cluster shot charts (to find similar shot charts given a set of features).

My approach for clustering the shot charts was to bin groups of shots, much like we do sometimes with visualization. By binning the groups of shots, it means I used data in the form of a vector, highlighting the frequency for individual locations, like so.

I separated shots into 14 locations as given by the stats.nba.com API, and I created a 14×1 vector per player over each season, containing the shot frequency for each location on the court. The locations are highlighted in the shot chart above. The reason I do not include the field goal percentage is because I was trying to highlight tendencies of the player, and FGP is irrelevant to that in my opinion.

I can’t use the actual raw X-Y coordinates because players take a different number of shots per game, which would make the dimensions of the vector different for every player. This would prevent the usage of k-means clustering on the data.

I ran the clustering algorithm, with the steps highlighted above, for two separate time frames, to see how the clusters have changed over time. The two time frames I selected were the “2016-17”, “2017-18”, “2018-19” (recent) seasons and the “1999-00”, “2000-01”, “2001-02” (old) seasons.

## Results

The number I decided on for the number of clusters was 3, but that was an arbitrary number. I can definitely try with a larger number of clusters and see where that takes me.

I first ran UMAP dimensionality reduction and highlighted different clusters, just to verify that there was something to highlight.

It’s obviously not easy to make any conclusions from this UMAP visualization alone, so I took some samples from all of the clusters highlighted by the algorithm.

Above, each row represents one cluster highlighted by the algorithm. The first row is obviously a cluster that highlights players that do not deviate from the paint much. It includes players like Dwight Howard and Ben Simmons.

However, the other two clusters that the algorithm highlighted seem extremely similar (2 and 3). Personally, I don’t see any stark differences between the two clusters, but in general, it seems like the second cluster is more inclined to “Moreyball”, meaning people in the second cluster take less mid-range shots than do people of the second cluster. However, the difference seems very low-key so I’m not really sure.

These are the relative amounts of each cluster in the overall dataset. It makes sense, as the number of players who only play in the paint is very low.

Here, the first row highlighted seems to be players who exemplify the “perimeter game”. This makes sense as the perimeter game was very prominent in the seasons we’re looking at.

The second cluster seems to highlight players who mainly rely on the mid-range game, and don’t really venture much into three-point-range. The third cluster seems to use the mid-range game, but also goes to three point game. The distinction between these two isn’t too eye-catching.

These are the relative frequencies of each cluster in the dataset. The mid-range game was quite prominent during this age, and the algorithm seems to agree.

## Conclusion

Really, the only cluster that seems to exist in both eras of basketball is the cluster with mid-range and three-point shooters. This really speaks to the quickly changing nature of basketball. The perimeter two is not being used much at all, nor is the pure mid-range game. This is clearly the result of analytics in the sport, as these shots just don’t provide as many points per shot taken.

There are definitely things that I can do better in this project. If you have any suggestions, I can definitely try implementing them.

All the code is at https://github.com/avyayv/blogposts/blob/master/clustershotcharts/

Thanks to Savvas Tjortjoglou for his code for outlining the NBA court in matplotlib.

# How useful (or useless) are preseason statistics for rookies?

Zion Williamson has been phenomenal this preseason for the New Orleans Pelicans. This has led to various opinions in the basketball world on how Zion will perform in the regular season.

Some say that Zion is going to be an All-Star in his rookie season. In fact, Stephen A. Smith made the bold claim that Zion’s rookie season will mirror Shaquille O’ Neal’s, based off Zion’s unparalleled efficiency in the paint.

To examine this idea more objectively, I attempted to look at a general case rookies in the preseason and the regular season and isolate some key statistics. One thing to take note of that affect interpretation of these results:

Teams have extreme variability of schedules in the preseason. For instance, one team could be playing a strong Lakers team each game, while another team plays teams like the Shanghai Sharks. Thus, it is tough to generalize anything from preseason to regular season. However, for the purposes of this article, we will assume that this will not affect player/team statistics by a wide margin.

I looked at three different basic statistics (points per game, assists per game, rebounds per game) in the regular season vs. the preseason. I plotted these values on separate histograms for regular season and preseason statistics.

These stats show that show that the overall spread of points scored during the regular season is skewed more right than the preseason points per games for rookies.

To examine this further, I wanted to look at whether players were less efficient during the regular season, or whether one of the main causes for this was because of less opportunity during the regular season.

In this scatter plot, the size of the dot corresponds to the rate of the increase of time played. Mathematically, it is simply expressed as

$(MPG_{regular season})/(MPG_{preseason}) * k$

This constant value is just so that one can distinguish between the sizes of the dots.

If you look closely at this plot, there is a clump of small dots below the line. (which is the line that represents the values staying the same). This means that most of these players received far less minutes during the regular season, which will in turn, skew the points per game histogram more to the right.

Thus, we can assume that rookies maintain fairly similar point averages across the preseason and the regular season. We can also generalize this to assists, with the same type types of plots.

We can see from this plot that regular season assist averages are also more skewed right than preseason assist averages.

And again, we see the same trend as we did with points. However the data is far more clumped near zero, which makes sense. Most players do not have the ball in their hands in order to facilitate too much.

When we look at rebounds, however, we can see that the distributions remain fairly similar across the regular season and the preseason.

When we look at the scatter plot for this graph, it looks like the number of points above the y=x line and the number of points below the line and above the line seems around the same.

There is obviously some correlation between preseason stats and regular season. The main bottleneck for rookie players which causes regular season statistics to dip is that rookie players do not get as many opportunities in the regular season as they do in the preseason.

However, we should expect Zion to continue to be the main man for the New Orleans Pelicans, and with his body frame and ability to score at will, don’t expect too much of a drop off in regular season statistics.

# A Quick 3 or a Quick 2?

Nearly every day in the NBA (playoffs included), there are close games that come down to the wire. We see teams with 3, 4, or 5-point deficits with only a shot-clock remaining quite often, and one of the questions commentators always ask during this situation is:

Do you go for the quick 2 and intentionally foul (and hope that the opponent will miss a free throw) or go for a three?

A lot of the time (we saw this with the Houston vs Golden State series), teams decide to go for the quick two, but other times, they go for the three (in the Golden State vs Portland series). Although it was a 3-point game in the latter game, versus a 5-point deficit in the second one, there should be some simple analytical way of determining when a team should go for the 3 or go for a 2.

Golden State is known for its shooting. Although we mainly consider Golden State to be a great 3-point shooter, GSW is full of great free-throw shooters as well. In the clutch, Steph shoots 97 percent from the line, which is one of the best in the league. In addition, Klay shot 100 percent from the line in the clutch this season, while Durant shot 91 percent. This means that the Warriors have at least 3 great options (out of 5) to give it to when they have a small lead with little time left.

Given all of these options, we’ll assume that the free throw percentage for the Warriors when they are intentionally fouled is 90% (which is obviously conservative).
In addition, we will assume that the average 2 point percentage in the clutch is 55 percent (for the opposing team)while the average 3 point percentage in the clutch is 30 percent (for the opposing team).

Now assuming it is a 3 point game and the opposing team has the ball, we have a couple scenarios:

1. Shoot a 3 -> tie the game
2. Shoot a 2 -> intentionally foul -> opposing team misses a free throw -> Shoot a 2
3. Shoot a 2 -> intentionally foul -> opposing team misses both free throws -> Shoot a 2
4. Shoot a 2 -> intentionally foul -> opposing team makes both free throws -> Shoot a 3

In all of these cases, the opposing team will catch up, but is it more likely to beat the Warriors by shooting a 3 on that play or shooting a 2 on that play?

Well, when you compute the probabilities using some simple multiplication, you get that the probability of winning when you shoot:

A THREE = 31.52%
A TWO = 19.11 %

Clearly, you should go for the three, and no matter what Mark Jackson and Stan Van Gundy said, this is the case.

However, this is only for the Warriors case. What if you have a bad free throw shooting team. Well, we can represent this as a graph, where the independent variable is the probability that the team with the lead makes their free-throw and the dependent variable is the probability of tying the game. (we are going to assume that the 3PT%=30%, while the 2PT%=55% for all teams)

In the above graph, the green line represents taking a three while the white line represents taking a two. Clearly, teams should always shoot the 3 when they are down 3 and have the ball.

Now let’s see what happens if it’s a 4 point game:
There are more possibilities if its a 4 point game, but we’ll remove some of them when we are generating the model (the ones that are essentially negligible because there is such a low probability that it will occur).

1. Shoot a 3 -> intentionally foul -> opposing team misses a free throw -> Shoot a 2
2. Shoot a 3 -> intentionally foul -> opposing team misses both free throws -> Shoot a 2
3. Shoot a 3 -> intentionally foul -> opposing team makes both free throws -> Shoot a 3
4. Shoot a 2 -> intentionally foul -> opposing team misses a free throw -> Shoot a 3
5. Shoot a 2 -> intentionally foul -> opposing team misses both free throws -> Shoot a 2
6. Shoot a 2 -> intentionally foul -> opposing team makes both free throws -> Shoot a 3 ……

When you are playing against the Warriors, the probabilities are like so:

A THREE = 10.43%
A TWO = 5.47 %
Again, we see that the probability to tie is higher when you go for a three versus if you go for a two. Again, we represent this with a graph, with the same X and Y variables.

Interestingly, it is better to go for the two when the opposing team has bad free throw shooting (< 79%) and you are down 4. However, if you have a high FT%, you should always go for the 3. At ~79% FT% it does not matter whether you go for the 2 or if you go for the 3.

Finally, we’ll look at the 5 point game case (the hopeless cause, basically).
At this point, you can basically accept the loss if you are playing against the Warriors. You need basically everything to line up in your favor (missed free throws, made threes, made twos). More concretely you need one of the following:

1. Shoot a 3 -> intentionally foul -> opposing team misses a free throw -> Shoot a 3
2. Shoot a 3 -> intentionally foul -> opposing team misses both free throws -> Shoot a 2
3. Shoot a 3 -> intentionally foul -> opposing team makes both free throws -> Take the L, basically
4. Shoot a 2 -> intentionally foul -> opposing team misses a free throw -> Take the L, basically
5. Shoot a 2 -> intentionally foul -> opposing team misses both free throws -> Shoot a 3
6. Shoot a 2 -> intentionally foul -> opposing team makes both free throws -> Take the L, basically ……

Against the Warriors:

A THREE = 1.59%
A TWO = 0.647%
With that probability to tie up the game, you should just take the loss 😞. Below, is the same graph as above, but for a five-point game.

CONCLUSION: There are different situations when teams should go for the 3 and the 2, but when it’s a 3 point game, ALWAYS go for the 3. Also, when you play the Warriors, you should always go for the 3. We should expect at least 2 close games this year during the finals, with the Raptors on the losing side. If the Raptors can play by this strategy, they should be able to win at least one of these games.

# Playmaking in the Playoffs vs. the Regular Season

The 2019 NBA Playoffs have been excellent, with teams playing at their absolute best. We’ve seen teams like the Warriors and the Bucks absolutely dominate, but how have these teams, along with other teams, changed their playmaking strategies? For instance, if we look at the Bucks in the Playoffs, they have obviously decided to make Giannis drive into the paint more and pass out less. This is due to the fact that Giannis’s points in the paint generate more points per shot (field goal percentage*a three or a two) than would a three-point shooter typically.

However, the Bucks have obviously employed a different strategy than did other teams in the playoffs. For instance, we know that the Denver Nuggets’ whole strategy depended on Nikola Jokic. However, it wasn’t his extraordinary ability to score that makes Jokic so great. Instead, it is his ability to make plays and get his teammates points, that allows Jokic’s team to win games. Thus, we saw the Nuggets embrace this strategy.
I expressed this idea of teams looking for playmaking with a simple statistic (Points from Assists adjusted for usage rate inflation). This stat basically allowed me to see
i) the quality of the passes (if they didn’t pass well, the passes wouldn’t translate into points)   ii) the number of passes (if they didn’t pass often, they wouldn’t generate points)
Then, I graphed these players on a graph to see how (in general) teams change their strategies with regards to passing and playmaking.

In the graph above, the red line represents a player whose points from assists does not change in the playoffs. However, we see that the try line of best fit here has a slope that is slightly lower than 1 (actually it is about 0.79 with an R^2 of 0.65). This shows that teams in today’s league are becoming more focused on playing through a star player in the playoffs, as we have seen with the effects of superstars in the league.
Although we saw (in my previous post) that teams typically employ the same usage rate to important players in the playoffs vs in the regular season, we see here that teams typically make their players shoot more and pass less. Instead of getting points off of assists like they did in the regular season, they find it more beneficial to score through their star players.
However, outliers in this dataset remain. Steph Curry has actually created more points from assists in the playoffs than he did in the regular season. We’ve seen this with the type of play the Warriors play.