The Magnus Prediction Model, version 7

Estimating Individual Impact on NHL 5v5 Shot Rates

August 2, 2023, Micah Blake McCurdy, @IneffectiveMath
(This model, "Magnus 7", is an updated version of Magnus 6, released in the northern summer of 2022. In the interests of providing a self-contained documentation here, I've copied the explanation there with changes as appropriate.)

What's New

For old hands who want to know the quick rundown of what is new this version: Furthermore, this summer I tested a number of promising ideas which turn out to make the prediction of future-season shot rates worse, so they have been discarded. In particular, I considered using pairs of seasons to estimate player ability; this improved my xG model but does not help here, so I've continued to use single seasons.


I would like to be able to isolate the individual impact of a given skater on shot rates from the impact of their teammates, their opponents, the scores at which they play, the zones in which they begin their shifts, the instructions of their head coach, their level of fatigue, how far away their bench is, and home-ice advantage. I have fit a regression model which provides such estimates. The most important feature of this model is that I use shot rate maps as the units of observation and thus also for the estimates themselves, permitting me to see not only what portion of a team's performance can be attributed to individual players but also detect patterns of ice usage.

Here, as throughout this article, "shot" means an event recorded by the NHL as either scored, saved, missed, or blocked. Since the league does not record shot locations for blocked shots, these locations are inferred from their block location.

The most sophisticated element of Magnus is the method for estimating the marginal effect of a given player on shot rates; that is, that portion of what happens when they are on the ice that can be attributed to their individual play and not the context in which they are placed. We know that players are affected by their teammates, by their opponents, by the zones their coaches deploy them in, and by the prevailing score while they play. Thus, I try to isolate the impact of a given player on the shots which are taken and where they are taken from.

Although regression is more mathematically sophisticated than some other measures, it is in no way a "black box". As we shall see, every estimate can be broken down into its constituent pieces and scrutinized. If you are uneasy with the mathematical details but interested in the results, you should skip the "Method" section and just think of the method as like a souped-up "relative to team/teammate statistics", done properly.


I use a simple linear model of the form \( Y \sim WX\beta \) where \(X\) is the design matrix, \(Y\) is a vector of observations, \(W\) is a weighting matrix, and \(\beta\) is a vector of marginals, that is, the impacts associated to each thing, independent of each other things. Each passage of 5v5 play with no substitutions is encoded in the model as two rows, one row with the home team as the attacking team, where the response entry in \(Y\) is the pattern of shots taken by the home team, and another row with the road team as the attacking team, where the response entry in \(Y\) is their pattern of shots. I call such passages of play with no substitutions "stints", by analogy with the same term used in basketball research, although the presence of on-the-fly changes in hockey mean that some stints can be very short. Note also that a single stint can contain within it several stoppages of play, including a variety of faceoffs, so long as the players on the ice do not change. A typical NHL season contains about a quarter million stints and so our design matrix \(X\) has about half a million rows. Happily, as we shall see imminently, there are only about two thousand covariates, so \(X\) has about a billion entries, most of which are zero. This is computationally non-trivial but still tractable with a level head and some half-decent computers.


The columns of \(X\) correspond to all of the different features that I include in the model. There are broadly, two different types of columns. Some terms occur in pairs, one for offence, one for defence:

Most of the remaining terms apply only to the offensive team:

Finally, there is one class of terms, new this year, that apply only to the defensive team: terms for measuring the tendency of individual goaltenders to give up rebounds. Goaltenders do not simply save shots or fail to do so, they can also stop play entirely (which usually but not universally causes the stint to end and a new one to start) or they can "allow" play to continue, possibly creating more shots to be taken against them. This impact on shot rates is of a different character than the way that skaters create or allow shots, and so here the coeffients are encoded in an unusual way. If the response for a given row has \(n\) shots in it, we set the "goalie rebound coefficient" in that row to \(n\) also. This pilfering of a scrap of information from the dependent variable to put into the design matrix is not quite orthodox but I don't see a better way to handle rebounds at this time.


The entries in \(Y\), the "responses" of the regression, are functions which encode the rate at which shots are generated from various parts of the ice. A shot with NHL-recorded location of \((x,y)\) is encoded as a two-dimensional gaussian centred at that point with width (standard deviation) of ten feet; this arbitrary figure is chosen because it is large enough to dominate the measurement error typically observed by comparing video observations with NHL-recorded locations and also produces suitable smooth estimates.

One detailed example (with no-longer current players for either team, but no matter) should make the structure of the model clear:

Suppose that the Senators are the home team and the Jets are the away team, and at a given moment of open play two Senators players (say, Karlsson and Phaneuf) jump on the ice, replacing two other Senators. This begins a new "stint". The score is 2-1 for Ottawa in the third period, and the other players (say, Pageau, Hoffman, and Stone for Ottawa, against Laine, Ehlers, Wheeler, Byfuglien, and Enstrom) began their shift fifteen seconds previously on a faceoff in Ottawa's zone. Play continues for 50 seconds, during which time Ottawa takes two shots from locations (0,80) and (-10,50) and Winnipeg takes no shots. This shift ends with the period and the players leave the ice. The game is the season-opener for Winnipeg, but Ottawa played their first game yesterday, using all of the named players except Phaneuf, who was scratched in that game for unspecified reasons, causing a great deal of unproductive twittering.

These fifty seconds are turned into two rows of \(X\) and two entries in \(Y\). First, the Ottawa players are considered the attackers, and the attacking columns for Pageau, Hoffman, Stone, Karlsson, and Phaneuf are all marked as "1". The Jets are considered the defenders and the defending columns for Laine, Ehlers, Wheeler, Byfuglien, and Enstrom are marked as "1". All of the other player columns, attacking or defending, are marked as "0". Because the Senators are winning 2-1, the score column for "leading by one" is marked with a "1" and the other score columns are marked as "0". Three of the Senators players are playing on-the-fly shifts, while the other two Senators skaters (that is, 40% of the skaters) are still playing the shift they began in their own zone, so the "home on-the-fly attacking" zone columns are marked with coefficients totalling 0.6, divided equally among the terms for seconds 1 through 50; and the "home defensive-zone attacking" zone columns are marked with coefficients totally 0.4, divided equally among the terms for seconds 16 through 66. The zones that the Jets players began their shifts in are not considered at this time. The Senators are the home team, so the "home 3rd" column is marked with a 1. The column for "Guy Boucher, attacking coach" is marked with a one, while the column for "Paul Maurice, defending coach" is marked with a one. Additionally, since Ottawa is leading by one in the third period, the "Guy Boucher, shell offence" term is marked with a one, all the other coaching columns are marked with zeros. All of Winnipeg's players are marked as "well-rested". Phaneuf is well-rested for Ottawa, but the other four skaters played last night, so the column for "well rested, attackers" is marked with 0.2 and "tired, attackers" column is marked with 0.8. Corresponding to this row of \(X\), an entry of \(Y\) is constructed as follows: Two gaussians of ten-foot width and unit volume are placed at (0,80) and (-10,50) and the two gaussians are added to one another. This function is divided by fifty (since the stint is fifty seconds long); resulting in a continuous function that approximates the observed shot rates in the shift. Finally, I subtract the league average shot rate from this. This function, which associates to every point in the offensive half-rink a rate of shots produced in excess of league average from that location, is the "observation" I use in my model.

Second, the same fifty seconds are made into another observation where the Jets players are considered the attackers, and the Senators are the defenders. The attacking player columns for the Jets are set to 1, the defending columns for the Senators players are set to 1. Since the Jets are losing, the score column of "trailing by one" is set to one. Since all of the Jets skaters are in the middle of an offensive-zone shift, the "OZ" terms are marked with coefficients that sum to 1, evenly distributed over the seconds 1 to 50; the Senators shift start locations are not considered here. The columns for "Guy Boucher, defending coach", and "Guy Boucher, shell defence", and "Paul Maurice, attacking coach" are each marked with a one, and all the other coaching columns are marked with zeros. Since all of Winnipeg players are well-rested, no rest columns are marked for offence. Phaneuf is well-rested for Ottawa, but the other four skaters played last night, so the column for "played last night, defenders", is marked with 0.8, since 80% of the defending skaters did so. All the other other rest columns are marked with zeros. The two rows have no non-zero columns in common. Since the Jets didn't generate any shots, the associated function is the zero function; I subtract league average shot rate from this and the result is placed in the observation vector \(Y\).

The weighting matrix \(W\), which is diagonal, is filled with the length of the shift in question. Thus, the above two rows will each have a weight of fifty.

By controlling for score, zone, teammates, and opponents in this way, I obtain estimates of each players individual isolated impact on shot generation and shot suppression.


Details of precisely how the model is fit can be found below; but we have decided to present the results first, knowing that they are of somewhat broader interest.

Game State

Every column in the regression corresponds to a map of shot rates over a half-rink. The simplest are perhaps the six "game state" terms:

Unsurprisingly, the home team has an advantage in each period, while the second period shows an uptick in shot danger for both teams. Perhaps more surprisingly, the third period shows a decrease in shots for both teams relative to the first period, even after accounting for score effects.

All of the maps are depicted here are to be understood as relative to league average expected goals for the season in question. Regions in red show more-and-more-dangerous shots coming from a given region of the ice than average, and blue regions show fewer-and-less-dangerous shot patterns than average. White regions see shot patterns that are roughly as dangerous as average. A full explanation of this expected goals model can be found here and the cleverness required to encode xG rates in pictures here.

For convenience, the xG rate of the term, relative to baseline 5v5 xG rate, is also shown in the neutral zone. So, displayed above, the effect on shots to being the home team in the second period +16.6% xG/60, which means that simply being the home team (in addition to any benefit gained by matchups) is associated with generating a pattern of shots likely to result in 16.6% more goals per hour than league average, given average shooting and goaltending talent.


The six score terms are as follows:

As we know, trailing teams dominate play, even though they still usually lose. This "global" pattern (that is, across the entire league, isolated from the system effects of any coaches) is not quite linear. Tied states are associated with cagey, negative hockey with fewer and less dangerous chances for both home and away teams. Teams leading by 1 take more shots off the rush; these shots are typically from the circles, and fewer shots from high in the zone. The opposite is true of teams losing by one, who generate more shots from cycling.


The fatigue terms are as follows:

Here "00" means "did not play last night or the night before", what I've called "well-rested"; "01" means "played two nights ago but not last night", this is "normal" and is the most common rest pattern. Finally "10" means "played last night but not the night before", this is the "tired" state which is associated with weaker results as we expect. (In rare cases (trades, reschedulings) a player might play NHL games on three consecutive nights, these players are encoded as "10"). Being well-rested is less beneficial (and more common) than being tired is harmful.


For each of the four zones in which one might start a shift, there are sixty different terms for the first sixty seconds of such shifts.

Consolidating each map into a single xG impact, as a % of league average, allows us to summarize all 240 of them:

The pooling penalties permit us to compare all of the zone starts to one another directly. As expected, starting in the offensive zone helps boost shot rates considerably, but this benefit tails off steadily as such shifts progress.

Starting in your own zone, or in the neutral zone, depresses shot rates, as one might expect, but perhaps surprisingly, this effect does not steadily fade towards zero; instead spending time in such a shift generally makes things worse. Evidently one generates offence from such starts by first generating a new shift for most or all of one's players, and then generates offence in those stints. The similarity of the offence generated in DZ and NZ shifts shows that the formidable obstacle is not one's own blue line, but rather the opponent's blue line.

Most shifts, however, do not start with faceoffs in any zone, but instead by skaters jumping over the boards on the fly. Such shifts have somewhat elevated shot rates at first, but then over time become less productive. Most of the game is played by players in the first thirty or forty seconds of their shifts, so we see that the sensation of games with fewer whistles being more exciting is not an illusion caused by total running time being shorter; such games really do have more and more dangerous 5v5 shots.

The maps themselves for every second can be seen here: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

or, if you prefer, in animated form. Note in particular how the early-seconds of an on-the-fly shift routinely have dump-in shots from the neutral zone.


The two penalty-residual terms are as follows:

The immediate aftermath of a penalty is, strictly speaking, 5v5 play, but it is extremely chaotic; when the skater serving the penalty returning to the ice is a defender it can be especially so. Usually this "residual" time is very short-lived, with the team coming off the power-play (who are usually the one with the puck) trying to urgently force a play to the net. The shot rates for this "power-play residual" period are very, very high, even higher than for the bulk of power-plays themselves, where a little more patience is common. Somewhat more interesting is the fact that the "penalty-kill residual" (the same stretches of time, considered from the attacking point of the view of the team which has just killed off a penalty) is also characterized by more dangerous rates of shots. The obvious play here is the penalized-skater-on-a-breakaway, but other opportunistic counterattacking rushes also boost this somewhat.


The peculiar encoding of the goaltender rebound terms means that each one should be interpreted as the rate of shots created (or suppressed) by a goaltender's tendence to generate rebounds per shot. A properly sophisticated analysis would take into account how some shots are more likely to create rebounds than others, but that will have to wait for another day.

Although these terms improve the prediction accuracy of the shot rate model slightly, they are not a particularly stable skill, with a year-to-year correlation of only 0.10.


One obvious reason to construct such a model is to suss out the abilities of players, which is intrinsically interesting. However, we can also make comparisons about how much each of the various factors in our models affect play.

"Raw" on-ice results

Before I describe the model outputs, I turn first to the raw observed results from the 2022-2023 regular season.

This graph is constructed as follows: for every skater who played any 5v5 minutes, compute the xG created and allowed by their team while they were on the ice; this produces a point \((x,y)\). Then, form the density map of all such points, where each point is weighted by the corresponding number of minutes played by the player in question. For ease of interpretation, I've scaled the xG values by league average, so that a value of \((5,5)\) on the graph means "threating to score 5% more than league average, and also threatening to be scored on 5% more than league average". As is my entrenched habit, the defensive axis is inverted, so results that are favourable to the skater in question appear in the top right (marked "GOOD" so that there can be no doubt). The contours are chosen so that ten percent of the point mass is in the darkest region, another ten percent in the next region, and so on. The final ten percent of the point mass is in the white region surrounding the shaded part of the graph. For convenience the weighted sum of the values is marked with a red dot; this dot represents the centre of the mass of the distribution, on which you could balance the whole shooting match on the tip of your steady fingertip. The black dot is the unweighted average.

Player Marginals

Repeating the above process with the individual player marginal estimates gives the following graph in green. The overall shape is still broadly normal, but the offence is slightly right-skewed (as is to be expected, considering the fact that rookies are penalized as they are).

One guiding principle of mine is that I don't include any terms in the model itself that identify players by position, since I would like to be able to measure differences between positions. With that in mind, here is the same density as above, but only for forwards:


and again for defenders:


Not surprisingly, forwards generate somewhat more offence, on average. In both cases the weighted averages skew slightly to "good" compared to the unweighted averages, which is consistent with coaches giving more icetime to their better players.

Teammate Impact

Once I have individual estimates for players in hand, I can make many interesting secondary computations to show the distribution of various effects. Most obviously, for a given player, I can form the sum of the player estimates of all the given player's teammates, weighted by their shared icetime, and then multiply it by four (since every player has four teammates at 5v5). These estimates of teammate quality can then be graphed as above:

Since I centred the plot of the player ability distribution, this one is also centred.

Opponent Impact

The same computation can be done for any given players 5v5 opponents: form the sum of all of their isolates, weighted by head-to-head icetime, and then multiply by five, since every player has five opponents at 5v5. This is graphed below in red:

First, notice that the scale on the axes is smaller than for teammates; more discussion of that will follow later. The variation in competition quality is more pronounced offensively, with the range of defensive ability faced by players is much smaller.

If we add together the effect of each player's teammates and the effect of their competition, we obtain a distribution that is roughly normal around zero, as expected.

Score Impact

Zone Impact

Coach Impact

Since coaches change and players change teams, and players do not play in every score state equally, not every player is affected by coaching systems the same amount. That said, there are only around thirty-five or forty head coaches in the league each year, so the distribution of coaching effects on players is lumpier.

The above shows the impact on players due to coaches, both the general terms, the score terms, and the shell terms, as appropriate. To see the "overall" coach terms specifically, I've condensed them here:

The "shell" terms, which are applied additionally for teams tied or leading by one or two in the the third period, are as follows:

As experience watching suggests, most coaches/players respond to late high-defensive-leverage situations by playing worse, especially worse defensively; although on a league-wide level this is much less pronounced than it once was.

To get some notion of "total" coach impact for each coach, I form the weighted sum of the overall system term with 0.2 times the "shell" term, since approximately 20% of the game is played in these states. This "synthetic" coach impact is as follows:


Strictly speaking, residuals for a regression refer to the differences between the individual observations (that is, each stint) and the predicted shot rates for that stint. As a perhaps more insightful alternative, for each player, I can compute the difference between their raw (observed) 5v5 on-ice results and what I would expect from computing the model outputs associated with the observed players and the zones and scores and home-ice advantage they player under. This is shown below:

This makes a sort of "goodness of fit" measure. The obviously circular shape is encouraging.

Relative Scale

In the above, I have shown each distribution on its own scale, so that I could discuss the shape of each one in turn. However, to understand their relative importance, they should all be placed on the same scale, which I have done below (except the residuals):

Here there are many interesting insights to be had:

Previous Work and Acknowledgements

Using zero-biased (also known as "regularized") regression in sports has a long history; the first application to hockey that I know of is the work of Brian MacDonald in 2012. His paper notes many earlier applications in basketball, for those who are curious about history; also I am very grateful for many useful conversations with Brian during the preparation of the first version of Magnus. Shortly after MacDonald's article followed a somewhat more ambitious effort from Michael Schuckers and James Curro, using a similar approach. Persons interested in the older history of such models will be delighted to read the extensive references in both of those papers.

More recently, regularized regression has been the foundation of WAR stats from both Emmanuel Perry and Josh and Luke Younggren, who publish their results at Corsica (now sadly defunct) and Evolving-Hockey, respectively.

Finally, I am very thankful to Luke Peristy for many helpful discussions, and to the generalized ridge regression lecture notes of Wessel N. van Wieringen which were both of immense value to me.

As far as I can tell, the extension of ridge regression to functions (that is, shot maps instead of just single numbers) is new, at least in this context.

Player and Coach Results

While my research has always been free to the public and will remain so, I restrict access to the specific regression outputs for players and coaches to website subscribers. To gain access, please subscribe at any level; then after logging in you should be able to find coaching terms, and individual player terms are on the career pages for each player.


To fit a simple model such as \(Y = WX\beta \) using ordinary least squares fitting is to find the \(\beta\) which minimizes the total error $$||Y-X\beta||_W^2$$ where \(||x||_W = \sqrt{x^TWx}\) is the "\(W\)-norm". Since \(X\beta\) is the vector of model-predicted results (where \(\beta\) is the set of model measurements and \(X\) is the circumstances we observe them in), the difference between \(X\beta\) and the observed results \(Y\) is a measure of error; forming the weighted product of \(Y-X\beta\) with itself is squaring this error (to ensure it's positive), and then we want to minimize this total error: hence the name "least squares".

When the entries of \(Y\) are numbers, this error expression is a one-by-one matrix which I naturally identify with the single number it contains, and I can find the \(\beta\) which minimizes it by the usual methods of matrix differentiation. To extend this framework to our situation, where the elements of \(Y\) are shot maps, I use a dissection of the half-rink into ten-thousand pieces, as a hundred-by-hundred grid. This divides the rink up into parcels one foot long by 0.85 feet wide, sufficiently coarse to permit efficient computation and sufficiently fine to appear smooth when results are gathered together. In particular, since the input shot data is smoothed into sums of gaussians before the regression is fit, we can compute the regression as if it were ten thousand separate regressions whose outputs are combined to form the maps for each term. It might be helpful to imagine a video broadcasting system, where input video is spliced into channels, each channel modified by appropriate filters for the display media at hand, and then each channel organized into an output which viewers can percieve as a single object. The practical benefit of this is that I can use the well-known formula for the \(\beta\) which minimizes this error, namely $$ \beta = (X^TX)^{-1}X^TWY $$ which makes it clear that the units of \(\beta\) are the same as those of \(Y\); that is, if I put shot rate maps in, I will get shot rate maps out.

However, I choose not to fit this model with ordinary least-squares, preferring instead to use generalized ridge regression; that is, instead of minimizing \( ||Y - X\beta||_W \) as in ordinary least squares, I add two so-called ridge penalties, to instead minimize: $$ ||Y - X\beta||_W^2 + ||\beta-\beta_0||_\Lambda^2 + ||\beta||_K^2$$ We consider three separate ways to measure how wrong the measurements \(\beta\) are: the first term says that deviation from the given year's data is bad; the second term says that deviations from our prior estimates \(\beta_0\) are bad, and the third term says that deviations from zero (that is, from league average) are bad. The matrixes \(W\), \(\Lambda\), and \(K\) control how bad specific deviations are to be considered. I call \(||\beta-\beta_0||_\Lambda^2\) the "historical penalty" and \(||\beta||_K^2\) the "structural penalty", and they are explained in detail below.

The historical penalty

Although the exposition here focusses on 2022-2023, the most-recent season of NHL hockey as I write this, in practice I fit this model successively, first on 2007-2008, the first season for which the NHL provides data at this level of detail, and then repeating the process for all subsequent seasons. Thus, after each season, I have a suite of estimates of player ability which I do not throw away. Since I am trying to estimate player ability (not performance), I take the opinion that our estimates ought to change slowly, since a player's athletic ability also usually changes slowly. Furthermore, the game of hockey itself also changes (that is, its rules change, and also teams in the aggregate choose their players and playstyles differently) but does so slowly. Thus, every term in the model is biased towards its value from the previous season. This is done by taking \(\beta_0\) to be the \(\beta\) from the previous season, and populating the diagonal elements of \(\Lambda\) itself with the estimated precisions from the previous season.

Every year, however, the league contains players who did not play in the previous season at all, for which prior estimates cannot be found in this way. For these players I assume a prior "replacement level" ability of -7.5% of league baseline for offence and +1% of league baseline for defence. These numbers were chosen by using an earlier version of this model with a slightly tweaked version of the replacement method advocated by the Evolving Wild twins, Josh and Luke Younggren. Specifically, since the standard nhl roster includes 13 forwards and 7 defenders, there are, canonically, \(13n\) "regular" forwards and \(7n\) "regular" defenders in the league, where \(n\) is the number of teams in the league at any given time (n=32 for 2022-2023, but has been 31 and 30 within the dataset considered here.) Thus, ranking all of the players at each position by total icetime, we can get a rough idea of "replacement player" as the ones lower in rank than \(13n\) or \(7n\), as appropriate. These replacement players should presumably be sufficiently young that they can't resonably be expected to decline in quality over the course of a single season, so we can measure their average ability and then use this as the new replacement level, iterating the process until the input and output values for these players no longer differ substantially; arriving at the (-7.5%,+1%) figures mentioned above.

It is not usually my habit to implicitly trust coaching staffs even to this very mild extent, but it seems unavoidable here.

In 2007-2008, with no prior year of data to guide me, I use this replacement level for every player.

The structural penalty

The structural penalty \(||\beta||_K^2\) is somewhat more technical and has several sub-components.

Player penalties

Very generally, even without knowing that the players are specific humans with histories, we also know that they are, by definition, NHL players. This knowledge includes, diffusely, a great deal of information; that they fall within a given age range, that they are almost certainly from certain countries and not others, that they have a track record of playing high-level hockey of one type or another, and so on. In short, they are selected members of a population about which we know (quantitatively), quite a bit before we watch any shifts of hockey played. Furthermore, the athletic abilities themselves which we are primarily interested in are constrained, very generally, by what we know about the possibilities of human performance, and ultimately by physics itself. The structural penalty allows us to encode our prior belief that all of the players are drawn from this population, here understood to be centred at league average (that is, mediant) shot rate. The size of the entries in the diagonal of \(K\) allow us considerable flexibility; small values encode a belief that the league has a broad range of talent, large values encode a belief that the talent level of the league as a whole is tightly clustered. For the players, we use a numerical value of ten thousand. I am not in a position to compute precisely what parameter would be optimal, but rough out-of-sample testing suggests it should be near this value.

(More involved theoretical estimates of optimal penalty values (such as computing the generalized cross-validation error, following Brian MacDonald (Section 5.3)) suggest much smaller values which give wildly varying (and hence unphysical) year-to-year estimates of player ability.)

A truly disciplined modeller would have used the league average from the previous year, in order to maintain faithful observance of using only prior data in specifying the prior, but I have cheated slightly and used the data from the season at hand instead. I crave the reader's forgiveness.

Players that we expect before we see their on-ice results or circumstances to be of similar ability can be fused, that is, penalty terms can be introduced to encode our prior belief that they are similar. I have chosen to fuse the Sedins in this way, with a penalty term of weight 10,000, because they are twins. I don't consider any more-distant relation than twins as legitimate grounds for this kind of prior.

Coaching penalties

For the various coaching terms, I use a penalty of one hundred thousand, ten times as high as the player penalties. This is a reversion of a previous change; to my surprise out-of-sample testing shows that better predictions are obtained with a much tighter range of coaching impacts.

Non-human penalties

For the score columns, the zone columns, the rest columns, and the venue-period terms (in short, for all of the terms that measure the impact of a thing that is not a human) I use a penalty value of 0. These terms are not theoretically constrained, and the fitting in this region of the model is like ordinary-least-squares fitting. I do not expect any overfitting in these terms because they all have very large data support.

Pooling penalties

Finally, I apply a set of penalties to encode prior knowledge about how the various structural terms are distributed, so that certain sets of terms can be "pooled" properly. For example; whatever the particulars of who starts which shifts in which zones, we expect that the total impact of zone starts on the entire league over a season to be zero, since every effect that helps one team should have a matching effect hurting their opponents. In previous iterations of this model I have worked around this by using one shift-start state (on-the-fly) as the "reference" state, thus having no model term, and then using indicators variables for the others (neutral zone, defensive zone, and offensive zone). However, this is a shade clumsy, requiring us to understand every term as "change from an on-the-fly shift", rather than, as I would strongly prefer, as "change from average".

Generally, if there are terms \(a_k\) in a model and we have weights \(w_k\) and would like to enforce $$ \sum_{k} w_ka_k = 0 $$ it suffices to ask that the square of \((\sum w_ka_k)^2\) should be small; expanding the square and interpeting the coefficients of each \(a_ia_j\) as the element \(K_{ij}\) of the penalty matrix \(K\) does the trick.

So, after forming the design matrix \(X\), but before considering the on-ice results \(Y\), we can obtain the relevant weights \(w_k\) with which to encode our prior expectation that shift-start terms should be, in aggregate, zero-sum.

In the same way, we enforce that:


All of these pooling penalties are multiplied by an extremely strong factor (a million times larger than the 10,000 penalty for the coaches and players above). Deviation from average for players or coaches is increasingly unlikely (but not impossible) as the deviation grows; here deviations from the desired sums are contradictions in terms.

The model can be fit to any length of time; in this article I'll be describing the results of fitting it iteratively on each season from 2007-2008 through 2022-2023 in turn. For 2007-2008, we use the replacement level described above; for later seasons I use the estimate from the previous season as the prior for each column, when present, and the replacement level when not.

Computational Details

So, after all that, the thing we would like to put our impatient hands on is the vector \(\beta\) of estimates which minimizes the following expression: $$ ||Y - X\beta||_W^2 + ||\beta-\beta_0||_\Lambda^2 + ||\beta||_K^2$$

Happily, the usual methods (that is, differentiating with respect to \(\beta\) to find the unique minimum of the error expression) gives a closed form for \(\beta\) as: $$ \beta = (X^TWX + \Lambda + K)^{-1}(X^TW Y + \Lambda \beta_0) $$ In effect, instead of assuming every season that the league is full of new players about whom we know nothing, we use all of the information from the last dozen years, implicitly weighted as seasons pass to prefer newer information without ever discarding old information entirely.

Persnickety folks, who might wonder if there really is a unique minimum to the complicated error expression we wish to minimize, may rest assured that it suffices for both of the penalty matrixes \(\Lambda\) and \(K\) to be positive semi-definite, which they are, as motivated readers may verify at their leisure. (For the etymologically and historically curious, ridge regression was invented in the first place by folks who were interested in solving problems where the matrix \(X^TWX\) was not invertible because a certain subspace of the columns of \(X\) was "too flat" in some quasi-technical sense. The artificial inserted matrix \(K\) adds a "ridge" in this notional geometry and makes the sum \(X^TWX + K\) conveniently invertible. The Bayesian "prior" interpretation, so important to my approach here, was discovered somewhat later.)