Measures of Data

Location and Spread

Fri, Aug 23, 2024

Measures of Data

Today, we’re going to discuss how we quantify the location and spread of numerical data.

By location I mean

the mean or average value of the data or
the median or middle value of the data.

By spread I mean

the standard deviation of the data or
the inter-quartile range of the data.

Location

As just mentioned, we measure location using either

the mean or average value of the data or
the median or middle value of the data.

Of these two, the mean is by far the more important! Ultimately, we’ll draw conclusions from data by modeling that data using a tool called a distribution. Determining how to do that requires the mean, rather than the median.

Having said that, medians (and, more generally, percentiles) are are an important part of the language of statistics and useful for understanding where a quantity lies.

The mean

The mean is a measure of where the data is centered. It is computed by simply averaging the numbers.

For example, our data might be \[2, 8, 2, 4, 7.\] The mean of the data is then \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\]

The median

Like the mean, the median is a measure of where the data is centered.

Roughly speaking, it represents the middle value. They way it is computed depends on how many numbers are in your list.

If the number of terms in your data is odd, then the median is simply the middle entry.

For example, if the data is \[1,3,4,8,9,\] then the median is \(4\).

If the number of terms in your data is even, then the median is simply the average of the middle two entries.

For example, if the data is \[1,3,8,9,\] then the median is \((3+8)/2 = 5.5\).

Connecting spread to the median

A measure of spread that’s connected to the median is called the inter-quartile range. This is defined in terms of percentiles.

Percentiles (also called quantiles)

The median is a special case of a percentile - 50% of the population lies below the median and 50% lies above.
Similarly, 25% of the population lies below the \(25^{\text{th}}\) percentile and 75% lies above.
- The \(25^{\text{th}}\) percentile is also called the first quartile.
- The \(75^{\text{th}}\) percentile is also called the third quartile.
- The second quartile is just another name for the median.
- The inter-quartile range is the difference between the third and first quartile.
One reasonable definition of an outlier is a data point that lies more than 3 inter-quartile ranges from the median.

Example

Suppose our data is \[4, 5, 9, 7, 6, 10, 2, 1, 5.\] To find percentiles, it helps to sort the data: \[1,2,4,5,5,6,7,9,10.\]

The median is definitely 5,
the \(25^{\text{th}}\) percentile might be 4,
the \(75^{\text{th}}\) percentile could be 7,
and the inter-quartile range would be 3.

There are differing conventions on how you interpolate when the number of terms doesn’t work well with the percentile, but these differences diminish with sample size.

Connecting spread to the mean

Variance and standard deviation

The inter-quartile range forms a measure of the spread of a population or sample related to the median of that population or sample.
The standard deviation forms a measure of the spread of a population or sample related to the mean of the population or sample.
Roughly, the standard deviation measures how far the individuals deviate from the mean on average.
The variance is defined to be the square of the standard deviation. Thus, if the standard deviation is \(s\), then the variance is \(s^2\).

Definitions

If we have a sample of \(n\) observations \[x_1,x_2,x_3, \ldots, x_n,\] then the sample variance is defined by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n-1}.\]
If \(s^2\) is the variance, then \(s\) is the standard deviation.

Example

Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (-1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]

Sample variance vs population variance

You might see the definition \[\sigma^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n}.\]
The difference in the definition is the \(n\) in the denominator, rather than \(n-1\).
The difference arises because
- The definition with the \(n\) in the denominator is applied to populations and
- The definition with the \(n-1\) in the denominator is applied to samples.
To make things clear, we will sometimes refer to sample variance vs population variance.

More often than not, we will be computing sample variance and the corresponding standard deviation.

Notation

For a population, we typically denote mean, variance, and standard deviation by

Mean: \(\mu\)
Variance: \(\sigma^2\)
Standard deviation: \(\sigma\)

For a sample, the corresponding concepts are denoted

Mean: \(m\)
Variance: \(s^2\)
Standard deviation: \(s\)

Visualizing mean and standard deviation

In order to understand mean and standard deviation, it helps to see how a histogram changes as the mean and standard deviation of the underlying data changes.

Computer computations

We really will have our first computer lab on Monday! So, let’s take a look at some code that computes these things.

CDC Data

We’ll use a specific, real world data set obtained from the Center for Disease Control that publishes loads of data - the Behavioral Risk Factor Surveillance System.

This is an ongoing process where over 400,000 US adults are interviewed every year. The resulting data file has over 2000 variables ranging from simple descriptors like age and weight, through basic behaviors like activity level and whether the subject smokes to what kind of medical care the subject receives.

I’ve got a subset of this data on my website listing just 8 variables for a random sample of 20000 individuals: https://www.marksmath.org/data/cdc.csv

Loading the data

Our sample of the CDC data set is a bit more than 1Mb; it’s best to view it programmatically. Here’s how to load and view a bit of it using a Python library called Pandas:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
df.tail()

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
19995	good	1	1	0	66	215	140	23	f
19996	excellent	0	1	0	73	200	185	35	m
19997	poor	0	1	0	65	216	150	57	f
19998	good	1	1	0	67	165	165	81	f
19999	good	1	1	1	69	170	165	83	m

Most of the variables (ie., the column names) are self-explanatory. My favorite is smoke100, which is a boolean flag indicating whether or not the individual has smoked 100 cigarettes or more throughout their life. You should probably be able to classify the rest as numerical or categorical.

Describing one dimensional, numerical data

Pandas provides a simple way to describe numerical data. Note that this description includes most of things we just discussed for each column.

df.describe()

	exerany	hlthplan	smoke100	height	weight	wtdesire	age
count	20000.000000	20000.000000	20000.000000	20000.000000	20000.00000	20000.000000	20000.000000
mean	0.745700	0.873800	0.472050	67.182900	169.68295	155.093850	45.068250
std	0.435478	0.332083	0.499231	4.125954	40.08097	32.013306	17.192689
min	0.000000	0.000000	0.000000	48.000000	68.00000	68.000000	18.000000
25%	0.000000	1.000000	0.000000	64.000000	140.00000	130.000000	31.000000
50%	1.000000	1.000000	0.000000	67.000000	165.00000	150.000000	43.000000
75%	1.000000	1.000000	1.000000	70.000000	190.00000	175.000000	57.000000
max	1.000000	1.000000	1.000000	93.000000	500.00000	680.000000	99.000000

Visualizing percentiles

In the previous slide, we see the following min, max and percentiles, which can be visualized using a box plot.

min	25%	50%	75%	max
48	64	67	70	93

import matplotlib.pyplot as plt
df.boxplot('height', vert=False, grid=False, whis=10)
plt.show();

Generating a histogram

df.hist('height', bins = 20, grid=False, edgecolor='black');
m = df.height.mean();
plt.plot([m,m],[0,5400], 'y--');
plt.show();