| season | team_name | massey_rating | |
|---|---|---|---|
| 361 | 2015 | Kentucky | 29.831088 |
| 618 | 2019 | Duke | 28.759867 |
| 624 | 2019 | Gonzaga | 28.646682 |
| 15 | 2010 | Kansas | 27.326564 |
| 696 | 2021 | Gonzaga | 27.291249 |
| 762 | 2022 | Gonzaga | 27.197263 |
| 671 | 2019 | Virginia | 27.038487 |
| 7 | 2010 | Duke | 26.674557 |
Mon, Dec 02, 2024
Last time, we learned about linear regression, which provides a model to predict future values. Today, we’ll discuss logistic regression, which provides a model to predict future probabilities.
Our first example is going to involve NCAA Basketball tournaments. For each game in a tournament, the idea is to predict the winner in a probabilistic sense. That is, we want to say something like “The probability that UNC defeats Duke is 0.7”.
You might remember the next slide, which shows my predictions for the 2022 tournament, from our first day of class.
We’re going to base these predictions, in part, on the so-called “Massey” ratings of the teams.
I’ve got a CSV file on my web space that lists every NCAA tournament team for every year from 2010 to 2023 together with that team’s Massey rating at the end of that season. Here are the top 8 teams from that data by Massey rating:
| season | team_name | massey_rating | |
|---|---|---|---|
| 361 | 2015 | Kentucky | 29.831088 |
| 618 | 2019 | Duke | 28.759867 |
| 624 | 2019 | Gonzaga | 28.646682 |
| 15 | 2010 | Kansas | 27.326564 |
| 696 | 2021 | Gonzaga | 27.291249 |
| 762 | 2022 | Gonzaga | 27.197263 |
| 671 | 2019 | Virginia | 27.038487 |
| 7 | 2010 | Duke | 26.674557 |
The Massey rating is constructed using a linear model to predict the score difference if two teams play in near future. In 2019, for example, the Massey rating predicted that Duke wold defeat Virgina by 1.7 points
UNCA has been in the tournament 4 times since 2010. Their negative Massey rating last year indicated that they would be expected to lose to the average team by about a point.
| season | team_name | massey_rating | |
|---|---|---|---|
| 120 | 2011 | UNC Asheville | 0.018425 |
| 190 | 2012 | UNC Asheville | 3.278971 |
| 458 | 2016 | UNC Asheville | 2.004859 |
| 872 | 2023 | UNC Asheville | -1.107387 |
I’ve also got a CSV file listing all NCAA tournament games from 2010 to 2023. The table below shows the last six rows and, for each row, we see
season,massey_diff, which is the first Massey rating minus the second, andseed_diff, which is the difference between the seeds 1-16,label indicating whether team 1 won the game or not.| season | team1_name | team1_massey_rating | team2_name | team2_massey_rating | massey_diff | seed_diff | label | |
|---|---|---|---|---|---|---|---|---|
| 1733 | 2023 | San Diego St | 16.311344 | Connecticut | 21.225009 | -4.913665 | 1 | 0 |
| 1732 | 2023 | Connecticut | 21.225009 | San Diego St | 16.311344 | 4.913665 | -1 | 1 |
| 1731 | 2023 | FL Atlantic | 13.820686 | San Diego St | 16.311344 | -2.490658 | 4 | 0 |
| 1730 | 2023 | San Diego St | 16.311344 | FL Atlantic | 13.820686 | 2.490658 | -4 | 1 |
| 1729 | 2023 | Miami FL | 12.780767 | Connecticut | 21.225009 | -8.444242 | 1 | 0 |
| 1728 | 2023 | Connecticut | 21.225009 | Miami FL | 12.780767 | 8.444242 | -1 | 1 |
Here are a few more observations on the data:
labels switch; thus, one row represents team1 as the winner and the other row represents team 1 as the loser.| season | team1_name | team1_massey_rating | team2_name | team2_massey_rating | massey_diff | seed_diff | label | |
|---|---|---|---|---|---|---|---|---|
| 1733 | 2023 | San Diego St | 16.311344 | Connecticut | 21.225009 | -4.913665 | 1 | 0 |
| 1732 | 2023 | Connecticut | 21.225009 | San Diego St | 16.311344 | 4.913665 | -1 | 1 |
| 1731 | 2023 | FL Atlantic | 13.820686 | San Diego St | 16.311344 | -2.490658 | 4 | 0 |
| 1730 | 2023 | San Diego St | 16.311344 | FL Atlantic | 13.820686 | 2.490658 | -4 | 1 |
| 1729 | 2023 | Miami FL | 12.780767 | Connecticut | 21.225009 | -8.444242 | 1 | 0 |
| 1728 | 2023 | Connecticut | 21.225009 | Miami FL | 12.780767 | 8.444242 | -1 | 1 |
Let’s plot this data with the massey_diff on the horizontal axis and the label on the vertical:
Note that the symmetry arises from the two ways of looking at the games - one labeled zero on the bottom and one labeled one on the top.
Now, we’re going to “fit” that data with a certain type of curve:
Note that the curve looks just like a cumulative distribution function. Thus, if Team 1 has Massey rating \(R_1\), Team 2 has Massey rating \(R_2\), and the curve is the graph of the function \(y=f(x)\), then we oughtta be able to compute the probability that Team 1 defeats Team 2 as \[f(R_1-R_2).\] That curve is called a logistic curve.
The algebraic form of the logistic curve is \[ \hat{p} = \frac{1}{1+e^{ax+b}}. \] While you don’t need to worry too much about this specific algebraic form, there are a few things worth knowing. In particular, the coefficients \(a\) and \(b\) turn up in regression analyses and it is important to know how to interpret them.
It turn out that we can solve for the \(ax+b\) in that formula to get \[ -\log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) = ax+b. \]
These terms all show up in Regression analyses.
Let’s take a look at some computer code to run logistic regression. We start by grabbing the paired game data and displaying the last six rows:
import pandas as pd
paired_games = pd.read_csv('https://www.marksmath.org/data/paired_tourney_games.csv')
paired_games.sort_index(ascending=False).head(6)| season | team1_name | team1_massey_rating | team2_name | team2_massey_rating | massey_diff | seed_diff | label | |
|---|---|---|---|---|---|---|---|---|
| 1733 | 2023 | San Diego St | 16.311344 | Connecticut | 21.225009 | -4.913665 | 1 | 0 |
| 1732 | 2023 | Connecticut | 21.225009 | San Diego St | 16.311344 | 4.913665 | -1 | 1 |
| 1731 | 2023 | FL Atlantic | 13.820686 | San Diego St | 16.311344 | -2.490658 | 4 | 0 |
| 1730 | 2023 | San Diego St | 16.311344 | FL Atlantic | 13.820686 | 2.490658 | -4 | 1 |
| 1729 | 2023 | Miami FL | 12.780767 | Connecticut | 21.225009 | -8.444242 | 1 | 0 |
| 1728 | 2023 | Connecticut | 21.225009 | Miami FL | 12.780767 | 8.444242 | -1 | 1 |
We now set up and fit a logistic regression model of the paired_games data using Python’s statsmodels library. That process looks like so:
import statsmodels.api as sm
train = paired_games
Xtrain = train[['massey_diff']]
Xtrain = sm.add_constant(Xtrain)
ytrain = train[['label']]
model = sm.Logit(ytrain, Xtrain).fit()Optimization terminated successfully.
Current function value: 0.572040
Iterations 6
train.Of course, the big question is how do we apply the result of the regression to make a prediction? Well, the model we built has a summary method that we can use to display some information:
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
| const | 1.104e-18 | 0.054 | 2.03e-17 | 1.000 | -0.106 | 0.106 |
| massey_diff | 0.1079 | 0.006 | 16.712 | 0.000 | 0.095 | 0.121 |
There’s a fair amount going on here in terms of inference. The middle three columns allow you to run hypothesis tests to determine if there’s really a relationship between the regression formula and the data; the last two determine 95% confidence intervals for the coefficients.
The most important items for making predictions are the coefficients in the first column labeled coef.
Let’s focus now on this most important part for prediction:

In this output, massey_diff=0.1079 and const=1.104e-18 refer to the coefficients of the massey_diff variable and the constant term, which is effectively zero. Thus, we have the following formula for the log-odds:
\[\begin{aligned} O = \log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) &= 0.1079\times\mathtt{massey\_diff}+1.104\times10^{-18} \\ &= 0.1079\times\mathtt{massey\_diff} \end{aligned}\]
From there, we can get the probabilistic prediction:
\[ \hat{p} = \frac{1}{1+e^{-O}}. \]
Last spring, I used something like this data through 2023 to help me with predictions for the 2024 tournament, when UConn defeated Purdue for their second straight championship. The semi-finals of that tournament featured
Thus, our log-odds \(O\) satisfy \[ O = 0.1079\times13.2031 = 1.4246 \]
Thus, the predicted probability that Purdue defeats NC State would be \[ \frac{1}{1+e^{-1.4246}} \approx 0.806058. \]
Sometimes, you can improve your probability computations by using more predictor variables. In this case, our log-odds looks like
\[ O = \alpha_0 + \alpha_1 X_1 + \alpha_2 X_2 +\cdots + \alpha_n X_n. \]
We still compute the probability via \[ \hat{p} = \frac{1}{1+e^{-O}}. \]
The basketball data, for example, contains not just a massey_diff variable but also the so-called seed_diff, which is the difference between the two teams (1-16) seed in their region. We can look back at the tournament slide to see what this means.
Of course, there tends to be a (negative) correlation between seed and performance so we might expect that it could help if we could use both these variables.
In the context of software output on logistic regression, the coefficients each appear as a row in the coefficient table.
Let’s suppose that a logistic regression analysis taking massey_diff and seed_diff into account yields the following:
| coef | stderr | z | P>|z| | [0.025] | ||
|---|---|---|---|---|---|---|
| const | -1.677e-16 | 0.054 | -3.09e-15 | 1.000 | -0.106 | 0.106 |
| massey_diff | 0.1106 | 0.013 | 7.506 | 0.000 | 0.074 | 0.127 |
| seed_diff | -0.0942 | 0.018 | -0.614 | 0.539 | -0.047 | 0.025 |
This indicates that the coefficient of massey_diff should be \(0.1106\) and that the coefficient of seed_diff should be \(-0.0942\).
It also indicates that the constant is effectively zero.
Focusing again on the coefficients
let’s again return to the 1 seed Purdue vs 11 seed NC State example.
The massey_diff is still \(13.2031\). That yields the following value for the log-odds:
\[ O = 0.1106\times13.2031 - 0.0942\times(-10) = 2.40226. \]
We then get probability computation of \[ 1/(1+e^{-2.40226}) \approx 0.916999. \]
Let’s take a quick look at the MyOpenMath HW and the associated Colab Notebook.