| genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender | |
|---|---|---|---|---|---|---|---|---|---|
| 19995 | good | 1 | 1 | 0 | 66 | 215 | 140 | 23 | f |
| 19996 | excellent | 0 | 1 | 0 | 73 | 200 | 185 | 35 | m |
| 19997 | poor | 0 | 1 | 0 | 65 | 216 | 150 | 57 | f |
| 19998 | good | 1 | 1 | 0 | 67 | 165 | 165 | 81 | f |
| 19999 | good | 1 | 1 | 1 | 69 | 170 | 165 | 83 | m |
Location and Spread
Fri, Aug 23, 2024
Today, we’re going to discuss how we quantify the location and spread of numerical data.
By location I mean
By spread I mean
As just mentioned, we measure location using either
Of these two, the mean is by far the more important! Ultimately, we’ll draw conclusions from data by modeling that data using a tool called a distribution. Determining how to do that requires the mean, rather than the median.
Having said that, medians (and, more generally, percentiles) are are an important part of the language of statistics and useful for understanding where a quantity lies.
The mean is a measure of where the data is centered. It is computed by simply averaging the numbers.
For example, our data might be \[2, 8, 2, 4, 7.\] The mean of the data is then \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\]
Like the mean, the median is a measure of where the data is centered.
Roughly speaking, it represents the middle value. They way it is computed depends on how many numbers are in your list.
If the number of terms in your data is odd, then the median is simply the middle entry.
For example, if the data is \[1,3,4,8,9,\] then the median is \(4\).
If the number of terms in your data is even, then the median is simply the average of the middle two entries.
For example, if the data is \[1,3,8,9,\] then the median is \((3+8)/2 = 5.5\).
A measure of spread that’s connected to the median is called the inter-quartile range. This is defined in terms of percentiles.
Suppose our data is \[4, 5, 9, 7, 6, 10, 2, 1, 5.\] To find percentiles, it helps to sort the data: \[1,2,4,5,5,6,7,9,10.\]
There are differing conventions on how you interpolate when the number of terms doesn’t work well with the percentile, but these differences diminish with sample size.
Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (-1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]
More often than not, we will be computing sample variance and the corresponding standard deviation.
For a population, we typically denote mean, variance, and standard deviation by
For a sample, the corresponding concepts are denoted
In order to understand mean and standard deviation, it helps to see how a histogram changes as the mean and standard deviation of the underlying data changes.
We really will have our first computer lab on Monday! So, let’s take a look at some code that computes these things.
We’ll use a specific, real world data set obtained from the Center for Disease Control that publishes loads of data - the Behavioral Risk Factor Surveillance System.
This is an ongoing process where over 400,000 US adults are interviewed every year. The resulting data file has over 2000 variables ranging from simple descriptors like age and weight, through basic behaviors like activity level and whether the subject smokes to what kind of medical care the subject receives.
I’ve got a subset of this data on my website listing just 8 variables for a random sample of 20000 individuals: https://www.marksmath.org/data/cdc.csv
Our sample of the CDC data set is a bit more than 1Mb; it’s best to view it programmatically. Here’s how to load and view a bit of it using a Python library called Pandas:
| genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender | |
|---|---|---|---|---|---|---|---|---|---|
| 19995 | good | 1 | 1 | 0 | 66 | 215 | 140 | 23 | f |
| 19996 | excellent | 0 | 1 | 0 | 73 | 200 | 185 | 35 | m |
| 19997 | poor | 0 | 1 | 0 | 65 | 216 | 150 | 57 | f |
| 19998 | good | 1 | 1 | 0 | 67 | 165 | 165 | 81 | f |
| 19999 | good | 1 | 1 | 1 | 69 | 170 | 165 | 83 | m |
Most of the variables (ie., the column names) are self-explanatory. My favorite is smoke100, which is a boolean flag indicating whether or not the individual has smoked 100 cigarettes or more throughout their life. You should probably be able to classify the rest as numerical or categorical.
Pandas provides a simple way to describe numerical data. Note that this description includes most of things we just discussed for each column.
| exerany | hlthplan | smoke100 | height | weight | wtdesire | age | |
|---|---|---|---|---|---|---|---|
| count | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.00000 | 20000.000000 | 20000.000000 |
| mean | 0.745700 | 0.873800 | 0.472050 | 67.182900 | 169.68295 | 155.093850 | 45.068250 |
| std | 0.435478 | 0.332083 | 0.499231 | 4.125954 | 40.08097 | 32.013306 | 17.192689 |
| min | 0.000000 | 0.000000 | 0.000000 | 48.000000 | 68.00000 | 68.000000 | 18.000000 |
| 25% | 0.000000 | 1.000000 | 0.000000 | 64.000000 | 140.00000 | 130.000000 | 31.000000 |
| 50% | 1.000000 | 1.000000 | 0.000000 | 67.000000 | 165.00000 | 150.000000 | 43.000000 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 70.000000 | 190.00000 | 175.000000 | 57.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 93.000000 | 500.00000 | 680.000000 | 99.000000 |
In the previous slide, we see the following min, max and percentiles, which can be visualized using a box plot.
| min | 25% | 50% | 75% | max |
|---|---|---|---|---|
| 48 | 64 | 67 | 70 | 93 |