Prediction Model Cordelia

September 25, 2016, Micah Blake McCurdy, @IneffectiveMath

My single-game prediction model for 2016-2017 is called Cordelia. It replaces last year's model, called Oscar.

What is included:

The items in bold are new with Cordelia, the others were all in Oscar. The goaltending terms in Cordelia are computed in a different way to Oscar, as I will explain. Some of the explanation here is copied from the above-linked description of Oscar, where appropriate.

What is not included.

I considered several possible features that have not been included, because (in the presence of the things which are included) they did not improve the predictivity of the model on the training data. Several other things are not included that I could presumably test but have not:

I don't expect to include any of these terms in future models.

Features and Output

Given a single game, Cordelia is a logistic regression model that estimates the probability with which the home team will win the game based on recent measurements of the skill of the two teams. It favours one team, or the other, but it does not predict the winner of games; that is, it is not designed for classification, for which a logistic regression is not a suitable model type. It was trained on all regular-season games from 2007-2016.

Oscar has nineteen features, or inputs. One input is categorical (either you have home ice, or ya don't) and the other eighteen are continuous; these inputs are regularized, that is, each measurement is expressed as a number of standard deviations above or below the mean of that variable, as measured in the training data.

All of the continuous features come in pairs, one for the home team, one for the away team. The input data is taken from all of each team's past games, both home and away, so a "Home team shooting percentage" term means "How important your shooting percentage is, when you are going to be the home team" and not "How good your shooting percentage was when you were the home team".

5v5 Shot measures

MeasureModel Coefficient (Beta)MeanStandard Deviation
5v5 Unblocked Shot Generation, Home Team +0.08840.53.90
5v5 Unblocked Shot Generation, Away Team -0.15640.53.95
5v5 Unblocked Shot Suppression, Home Team-0.11640.64.01
5v5 Unblocked Shot Suppression, Away Team+0.08940.54.06

In training, this was measured as the score-and-venue-adjusted unblocked shots per sixty minutes of 5v5 play during the previous 25 games, or as many games as were available early in the season. Considering larger samples of games weakened the fit and predictivity of the model, as did adjusting the shot numbers for strength-of-schedule.

The mean value over 2009-2015 for home unblocked shot generation is 40.5 per sixty minutes of 5v5 hockey, with a standard deviation of 3.9 unblocked shots per sixty minutes. A team whose recent results are 44.6 unblocked shots per sixty minutes is one standard deviation above average, and thus their home generation term is +0.088. This is positive, which means they are more likely to win the game under consideration, which makes sense -- they generate more offence than a typical team. The shot generation of the away team has the opposite sign, as we expect, and is larger in magnitude. That is, the ability of the away team to generate unblocked shots has a larger impact on the result of the game than the same ability in the home team. The ability of the home team to suppress unblocked shots is even more important, with the largest coefficient of the four.

For the away team, generation is more important. For the home team, suppression is more important. This suggests that the ability of the home team to set a defence and the away team to break it down (or fail to) is more important than the converse.

Oscar and Pip (my previous model) had the above features, with broadly the same patterns.

I am routinely asked why I chose to use unblocked shots instead of all shots. In my training data, both predict winning equally well; I choose to use unblocked shots because it permits me to use blocked shots in some future model, since blocked shots are not confounded with unblocked shots. Obviously blocked shots are confounded with all shots since the former is a subset of the latter.

5v5 Goaltending

Cordelia has two goaltending terms, one for the home goalie and one for the away goalie.

MeasureModel Coefficient (Beta)MeanStandard Deviation
5v5 goals per shot-on-goal, Home team-0.0710.07710.00605
5v5 goals per shot-on-goal, Away team+0.0530.07760.00604

For a given goaltender, we measure their goals allowed per shot-on-goal at 5v5 during the regular season in the past two years. Although this number is regressed by the associated model coefficient, we alter the covariate value itself for goalies whose recent body of work is not very substantial. If a goaltender has faced two thousand shots-on-goal, we use their covariate as is; if not, we artificially add shots-on-goal and goals at league average rates (8.2%) until we have two thousand shots-on-goal.

Goalies at the very beginning of their careers will appear to be very close to league average, as will goalies with a long career but for whom serious injury or perhaps a stint in another league have denied us of a body of fresh data. In this way I rely on the judgment of the coaches who select players; if a player is dressing for an NHL game I expect that there is a good reason to believe that the player is "NHL calibre", that is, within some kind of nodding acquaintance of league average performance.

Numerically, we see that the quality of the away goalie is more important than that of the home goalie. Furthermore, having a goalie one standard deviation better than average has a smaller impact on winning than having one standard deviation better-than-average unblocked shot generation or suppression. The coefficient for the home term is negative and vice-versa because better goalies have lower numbers of goals per shot-on-goal.

Team scoring percentage

Cordelia has two terms for capturing shooting efficiency.

MeasureModel Coefficient (Beta)MeanStandard Deviation
5v5 Goals per Shot-on-goal, Home team+0.0420.07790.017
5v5 Goals per Shot-on-goal, Away team-0.0710.07800.017

The shooting talent for the teams is measured over the past 25 games, like the generation and suppression terms. Where the goalie terms were very similar in magnitude, the shooting term for the away team is almost double the magnitude of that of the home term. This confirms our finding above that the offensive talent of the away team is more important than that of the home team.

Special Teams Terms

Cordelia has four terms for special teams:

MeasureModel Coefficient (Beta)MeanStandard Deviation
5v4 Shot Generation, Home Team +0.04595.312.9
5v4 Shot Generation, Away Team -0.03295.313.3
4v5 Shot Suppression, Home Team-0.00795.312.0
4v5 Shot Suppression, Away Team+0.02695.512.2

Here we have widened our view from unblocked shots to all shots; simply because they are more predictive; as above we measure the shots per sixty minutes over the last 25 games. Having a power-play which can generate shots at one standard deviation above league average has about as much impact as having a having a goaltender one standard deviation above average. A slavish adherence to the Akaike Information Criterion would not have permitted the penalty killing terms to survive, but I have decided to include them for reasons of symmetry and pedagogy.

Penalty Terms

Cordelia has four terms for penalties:

MeasureModel Coefficient (Beta)MeanStandard Deviation
Penalties Drawn, Home Team+0.0433.30.62
Penalties Taken, Home Team+0.0073.30.67
Penalties Drawn, Away Team+0.0073.30.63
Penalties Taken, Away Team-0.0053.30.64

We only consider non-offsetting minor penalties drawn or taken in all situations. Coincidental minors (which don't change skater numbers) are not included; neither are major penalties or misconducts of any length, offsetting or not. Surprisingly, the home team's history of drawing (or failing to draw) penalties is tremendously more important than any of the other factors, confirming what others have written about persistent home advantages in refereeing decisions. In fact, all of the other terms would be removed from the model if parsimony were higher than symmetry on my list of concerns.

Rest

Cordelia has two rest terms:

MeasureModel Coefficient (Beta)MeanStandard Deviation
Days of Rest, Home team+0.00921.20.62
Days of Rest, Away team-0.06551.00.70

Again surprisingly, rest is not symmetric---being well rested (or not) is much more important to the road team than the home team, although home teams are generally about twenty percent more rested than road teams.

Home-ice Advantage

Finally, Cordelia includes a home-ice advantage term of 0.182; this corresponds to a historical advantage of 54.5%.

Training details

Cordelia was trained on all regular season games from 2009-2016. My database includes the previous two years also, but I decided to exclude them since all of the goalies in those years artificially appear to have similar talent. I used a logistic regression, with regulation and overtime wins coded as having win probability 1, regulation and overtime losses as having win probability 0, and shootouts as having win probability 0.5 regardless of who won them.