The Magnus Prediction Models

September 18, 2018, Micah Blake McCurdy, @IneffectiveMath

Estimating Shooter and Goalie Talent

I am interested in isolating which NHL players shoot the puck well, and which NHL goaltenders do a good job at preventing shots from becoming goals. To that end I have fit a regression model which replicates some of the simple features of shooting and saving. Throughout this article, when I say "shot" I will mean "unblocked shot", that is, goals, saves, and misses (including shots that hit the post or the crossbar). Furthermore, when I talk of shooting talent, I mean the ability to score more than one would expect given the shot location, so a player may well take a lot of shots from great scoring locations and still be "a bad shooter" in some sense. Generating many such shots is obviously desirable and surely can be done more often by talented players, but I do not consider any such talents to be part of shooting talent, which is (half of) the subject of this article.

Throughout, I'll be using only 5v5 shots, since I think the hockey assumptions underlying the model are only valid for a single score state. However, one could presumably fit such a model (with perhaps slightly different tuning parameters) for 5v4 and even for 5v3, and then obtain aggregate estimates for players by combining their estimates from the various different models.

Method

Once a shot is being taken by a given player from a certain spot against a specific goaltender, I estimate the probability that such a shot will be a goal. This process is modelled with a generalized ridge logistic regression, for a detailed exposition please see Section 3. Briefly: I use a design matrix for which every row is a shot with the following columns:

I make a slightly unusual modification to shot distances; namely, shots which are recorded as coming from closer than ten feet are assigned a distance of 10ft. This is to stop small variations in shot location from having outsize effects on the regression, and also because it is close to the threshold of minimum human reaction time for goaltenders given typical NHL wrist shot speeds.

The observation is 1 for goals and 0 for saves or misses. The model is fit by maximizing the likelihood of the model, that is, for a given model, form the product of the predicted probabilities for all of the events that did happen (90% chance of a save here times 15% of that goal there, etc.). Large products are awkward, so we solve the mathematically equivalent problem of maximizing the logarithm of the likelihood, and before we do so we add a term of the form \(-\beta^T\Lambda\beta\), where we use \(\Lambda\) to encode our prior knowledge, as described below.

Simple formulas for the \(\beta\) which maximixes this likelihood to not seem to exist, but we can still find it by iteratively computing: $$ \beta_{n+1} = ( X^TX + \Lambda )^{-1} X^T ( X \beta_n + Y - f(X,\beta_n) ) $$ where \(f(X,\beta)\) is the vector function whose entry as position i is \((1 + \exp(-X_i\beta))^{-1}\) where \(X_i\) is the i'th row of \(X\) (this choice of \(f\) is what makes the regression logistic). By starting with \(\beta_0\) as the zero vector and iterating until convergence, I obtain estimates of shooter ability, goaltending ability, with suitable modifications for shot location and type.

This model is zero-biased, which is to say that we consider deviations from average ability to be on-their-face unlikely and bias our results towards average. Another way of saying the same thing is to say that we are beginning with an assumption (of a certain strength) that all players are of league average ability and then letting the observed data slowly update our knowledge, instead of beginning with an assumption that we know nothing about the shooters and goaltenders at all. The bias controlled by the matrix \(\Lambda\), which must be positive definite for the above formula to be the well-defined solution which makes \(\beta\) the one which minimizes the total error. As in my 5v5 shot rate model, I use a diagonal matrix, where the entries correspoding to goaltenders and shooters are \(\lambda = 100\) and those corresponding to all other columns are 0.001, that is, very close to zero. As for that model, the non-trivial \(\lambda\) values were chosen by varying \(\lambda\) and choosing a value where player estimates have stabilized.

In the future, I will publish results for all seasons, but for now, I record the results of fitting this model on all of the 5v5 shots in the 2016-2018 regular seasons. First, the non-player covariates are:

CovariateValue
Constant-2.55
Slapshot+0.0836
Tip/Deflection-0.222
Backhand-0.175
Wraparound-0.300
Rush+0.228
Rebound+0.754
Distance-2.86
Visible Net+1.15

Logistic regression coefficient values can be difficult to interpret, but negative values always mean "less likely to become a goal" and positive values mean "more likely to become a goal". To compute the probability that a shot with a given description will become a goal, add up all of the model covariates to obtain a number, and then apply the logistic function to it, that is, $$ x \mapsto \frac{1}{1 + \exp(-x)}$$ This function (after which the regression type is named) is very convenient for modelling probabilities, since it monotonically takes the midpoint of the number line (that is, zero) to 50% while taking large negative numbers to positive numbers close to zero and very large positive numbers to positive numbers close to one.

Thus, for instance, we might want to compute the goal probability of a wrist shot from 30 feet out (just below the tops of the circles), on the split line, neither on the rush nor a rebound. To do this, begin with the constant value -2.55. We have encoded by dividing by 89, so we multiply 30/89 times the distance coefficient of -2.86 to obtain -0.964. From the split line, the visible net is 1, so we add +1.15. Wrist and snap shots are taken as the base category, so no shot type term needs to be added. Since the shot is neither a shot nor a rebound, we have all the terms we need, adding them together gives -2.364. Applying the logistic function gives 8.6%, close to the historical percentage of six to eight percent from this area.

The overall features of the model are more or less as expected---shots from farther away are less likely to go in, seeing more of the net is good, rush shots are good, rebound shots are even better. The (very slight) positive value for slapshots and negative value for tips and deflections may seem surprising at first, after all, slapshots are scored only rarely and tips score often. However, slapshots are systematically taken far from the net, and tips and deflections almost always from close to the net, after accounting for shot location there is almost no difference between wrist and slap shots and tips are, after all, somewhat less precise than wrist shots and the player tipping the prior shot generally isn't looking at the net.

Player Results

As the above example shows, the model can already be used without specifying shooters or goaltenders. However, this is perhaps a little boring. Below are the values for all the goaltenders who faced at least one shot in the 2016-2018 regular seasons. I've inverted the scale so that the better performances are at the top.

The scale is the same units as for the non-player covariates above, so even the best or worst performances are smaller than the effect of a shot being a rush shot, for instance, consistent with goaltending performances being broadly similar across the league.

Similarly for forward and defender results, which I've put on separate pages for performance reasons.

Minimum Minutes: