2.3: Measures of Spread
The location of the center of a data set is important, but also important is how much variability or spread there is in the data. If a teacher gives an exam and tells you that the mean score was 75% that might make you happy. But then if the teacher says that the spread was only 2%, then that means that most people had grades around 75%. So most likely you have a C on the exam. If instead you are told that the spread was 15%, then there is a chance that you have an A on the exam. Of course, there is also a chance that you have an F on the exam. So the higher spread may be good and it may be bad. However, without that information you only have part of the picture of the exam scores. So figuring out the spread or variability is useful.
Measures of Spread or Variability: These values describe how spread out a data set is.
There are different ways to calculate a measure of spread. One is called the range and another is called the standard deviation. Let’s look at the range first.
Range: To find the range, subtract the minimum data value from the maximum data value. Some people give the range by just listing the minimum data value and the maximum data value. However, to statisticians the range is a single number. So you want to actually calculate the difference.
Range = maximum - minimum
The range is relatively easy to calculate, which is good. However, because of this simplicity it does not tell the entire story. Two data sets can have the same range, but one can have much more variability in the data while the other has much less.
Example \(\PageIndex{1}\): Finding the Range
Find the range for each data set.
- 10, 20, 30, 40, 50
Range = 50 - 10 = 40
- 10, 35, 36, 37, 50
Range = 50 - 10 = 40
Notice both data sets from Example \(\PageIndex{1}\) have the same range. However, the one in part b seems to have most of the data closer together, except for the extremes. There seems to be less variability in the data set in part b than in the data set in part a. So we need a better way to quantify the spread.
Instead of looking at the difference between highest and lowest, let’s look at the difference between each data value and the center. The center we will use is the mean. The difference between the data value and the mean is called the deviation.
Deviation from the Mean: data value - mean = \( x - \overline{x}\)
To see how this works, let’s use the data set from Example \(\PageIndex{1}\). The mean was about 62.7°F
| \(x\) | \( x - \overline{x}\) |
| 71 | 8.3 |
| 59 | -3.7 |
| 69 | 6.3 |
| 68 | 5.3 |
| 63 | 0.3 |
| 57 | -5.7 |
| 57 | -5.7 |
| 57 | -5.7 |
| 57 | -5.7 |
| 65 | 2.3 |
| 67 | 4.3 |
| Sum | 0.3 |
Notice that the sum of the deviations is around zero. If there is no rounding of the mean, then this should add up to exactly zero. So what does that mean? Does this imply that on average the data values are zero distance from the mean? No. It just means that some of the data values are above the mean and some are below the mean. The negative deviations are for data values that are below the mean and the positive deviations are for data values that are above the mean. So we need to get rid of the sign (positive or negative). How do we get rid of a negative sign? Squaring a number is a widely accepted way to make all of the numbers positive. So let’s square all of the deviations.
Squared Deviations from the Mean: To find these values, square the deviations from the mean. Also, you can think of this as being the squared distance from the mean.
So for the data set, let’s find the squared deviations.
| \( x \) | \( x - \overline{x}\) | \( (x - \overline{x})^{2}\) |
| 71 | 8.3 | 68.89 |
| 59 | -3.7 | 13.69 |
| 69 | 6.3 | 39.69 |
| 68 | 5.3 | 28.09 |
| 63 | 0.3 | 0.09 |
| 57 | -5.7 | 32.49 |
| 57 | -5.7 | 32.49 |
| 57 | -5.7 | 32.49 |
| 57 | -5.7 | 32.49 |
| 65 | 2.3 | 5.29 |
| 67 | 4.3 | 18.49 |
| Sum | 0.3 | 304.19 |
Now that we have the sum of the squared deviations, we should find the mean of these values. However, since this is a sample, the normal way to find the mean, summing and dividing by \(n\), does not estimate the true population value correctly. It would underestimate the true value. So, to calculate a better estimate, we will divide by a slightly smaller number, \(n-1\). This strange average is known as the sample variance.
Sample Variance: This is the sum of the squared deviations from the mean divided by \(n-1\). The symbol for sample variance is \(s^2\) and the formula for the sample variance is:
\(s^2 = \dfrac{\sum (x - \overline{x})^2 }{n-1}\)
For this data set, the sample variance is
\(s^2 = \dfrac{304.19}{11-1} = \dfrac{304.19}{10} = 30.419\)
The variance measures the average squared distance from the mean. Since we want to know the average distance from the mean, we will need to take the square root at this point.
Sample Standard Deviation: This is the square root of the variance. The standard deviation is a measure of the average distance the data values are from the mean. The symbol for sample standard deviation is and the formula for the sample standard deviation is
\(s = \sqrt{s^2} = \sqrt{\dfrac{\sum (x - \overline{x})^2 }{n-1}}\)
Thus, for this data set, the sample standard deviation is \(s = \sqrt{30.419} \approx 5.52 ^{\circ}F\).
Note: The units are the same as the original data.
Since the sample variance and the sample standard deviation are used to estimate the population variance and population standard deviation, we should define the symbols and formulas for those as well.
| Population Variance: \(\sigma^2= \dfrac{\sum (x - \mu)^2 }{N}\) |
| Population Standard Deviation: \(\sigma = \sqrt{\sigma ^2} = \sqrt{\dfrac{\sum (x - \mu)^2 }{N}}\) |
Example \(\PageIndex{2}\): Finding the Range, Variance, and Standard Deviation
A random sample of unemployment rates for 10 counties in the EU for March 2013 is given
| 11.0 | 7.2 | 13.1 | 26.7 | 5.7 | 9.9 | 11.5 | 8.1 | 4.7 | 14.5 |
(Eurostat, n.d.)
Find the range, variance, and standard deviation.
Since this is a sample, then we will use the sample statistics formulas.
In Example \(\PageIndex{3}\), we calculated the mean to be 11.24%. The maximum value is 26.7% and the minimum value is 4.7%. So the range is:
range = 26.7 – 4.7 = 22.0%
To find the variance and the standard deviation, it is easier to use a table then the formula. The table follows the formula though, so they are the same thing.
| x | \(x - \overline{x}\) | \((x - \overline{x})^2\) |
| 11.0 | -0.24 | 0.0576 |
| 7.2 | -4.04 | 16.3216 |
| 13.1 | 1.86 | 3.4596 |
| 26.7 | 15.46 | 239.0116 |
| 5.7 | -5.54 | 30.6916 |
| 9.9 | -1.34 | 1.7956 |
| 11.5 | 0.26 | 0.0676 |
| 8.1 | -3.14 | 9.8596 |
| 4.7 | -6.54 | 42.7716 |
| 14.5 | 3.26 | 10.6276 |
| Sum | 0 | 354.664 |
Sample variance:
\(s^2 = \dfrac{354.664}{10-1} = \dfrac{354.664}{9} \approx 39.40711111\)
Sample standard deviation:
\(s = \sqrt{39.4071111} \approx 6.28 \%\)
So, the unemployment rates for countries in the EU are approximately 11.24% with an average spread of about 6.28%. Since the sample standard deviation is fairly high compared to the mean, then there is a great deal of variability in unemployment rates for countries in the EU. This means that countries in the EU have rates that are much lower than the mean and some that have rates much higher than the mean.
Percentiles
There are other calculations that we can do to look at spread. One of those is called percentile. This looks at what data value has a certain percent of the data at or below it.
Percentiles: A value with k-percent of the data at or below this value.
For example, if a data value is in the 80th percentile, then 80% of the data values fall at or below this value.
We see percentiles in many places in our lives. If you take any standardized tests, your score is given as a percentile. If you take your child to the doctor, their height and weight are given as percentiles. If your child is tested for gifted or behavior problems, the score is given as a percentile. If your child has a score on a gifted test that is in the 92nd percentile, then that means that 92% of all of the children who took the same gifted test scored the same or lower than your child. That also means that 8% scored the same or higher than your child. This may mean that your child is gifted.
Example \(\PageIndex{3}\): Interpreting Percentiles
Suppose you took the SAT mathematics test and received your score as a percentile.
- What does a score in the 90th percentile mean?
90 percent of the scores were at or below your score (You did the same as or better than 90% of the test takers.)
- What does a score in the 70th percentile mean?
70% of the scores were at or below your score.
- If the test was out of 800 points and you scored in the 80th percentile, what was your score on the test?
You do not know! All you know is that you scored the same as or better than 80% of the people who took the test. If all the scores were really low, you could have still failed the test. On the other hand, if many of the scores were high you could have gotten a 95% on the test.
- If your score was in the 95th percentile, does that mean you passed the test?
No, it just means you did the same as or better than 95% of the other people who took the test. You could have failed the test, but still did the same as or better than 95% of the rest of the people.
There are three percentiles that are commonly used. They are the first, second, and third quartiles, where the quartiles divide the data into 25% sections.
First Quartile (Q
1
):
25th percentile (25% of the data falls at or below this value.)
Second Quartile (Q
2
or M):
50th percentile, also known as the median (50% of the data falls at or below this value.).
Third Quartile (Q
3
):
75th percentile (75% of the data falls at or below this value.)
To find the quartiles of a data set:
Step 1: Sort the data set from the smallest value to the largest value.
Step 2: Find the median (M or Q2).
Step 3: Find the median of the lower 50% of the data values. This is the first quartile (Q1).
Step 4: Find the median of the upper 50% of the data values. This is the third quartile (Q3).
If we put the three quartiles together with the maximum and minimum values, then we have five numbers that describe the data set. This is called the five-number summary.
Five-Number Summary: Lowest data value known as the minimum (Min), the first quartile (Q1), the median (M or Q2), the third quartile (Q3), and the highest data value known as the maximum (Max).
Also, since we have the quartiles, we can talk about how much spread there is between the 1st and 3rd quartiles. This is known as the interquartile range.
Interquartile Range (IQR):
IQR = Q3 – Q1
There are times when we want to look at the five-number summary in a graphical representation. This is known as a box-and-whiskers plot or a box plot.
Box Plot: Plot of the five-number summary
A box plot is created by first setting a scale (number line) as a guideline for the box plot. Then, draw a rectangle that spans from Q1 to Q3 above the number line. Mark the median with a vertical line through the rectangle. Next, draw dots for the minimum and maximum points to the sides of the rectangle. Finally, draw lines from the sides of the rectangle out to the dots.
Example \(\PageIndex{4}\): Find the Five-Number Summary and IQR and Draw a Box Plot (Odd Number of Data Points)
The first 11 days of May 2013 in Flagstaff, AZ, had the following high temperatures (in °F):
| 71 | 59 | 69 | 68 | 63 | 57 |
| 57 | 57 | 57 | 65 | 67 |
(Weather Underground, n.d.)
Find the five-number summary and IQR and draw a box plot.
To find the five-number summary, you must first put the numbers in order from smallest to largest.
57, 57, 57, 57, 59, 63, 65, 67, 68, 69, 71
Then find the median. The number 63 is in the middle of the data set, so the median is 63°F. To find Q1, look at the numbers below the median. Since 63 is the median, you do not include that in the listing of the numbers below the median. To find Q3, look at the numbers above the median. Since 63 is the median, you do not include that in the listing of the numbers above the median.
Looking at the numbers below the median, the median of those is 57. Q1 = 57°F. Looking at the numbers above the median, the median of those is 68. Q3 = 68°F.
Now find the minimum and maximum. The minimum is 57°F and the maximum is 71°F. Thus, the five-number summary is:
Min = 57°F
Q1 = 57°F
Med = Q2 = 63°F
Q3 = 68°F
Max = 71°F.
Also, the IQR = Q3 – Q1 = 68 – 57 = 11°F
Finally, draw a box plot for this data set as follows:
Figure \(\PageIndex{7}\): Box Plot
Temperatures in °F in Flagstaff, AZ, in early May 2013
Notice that the median is basically in the center of the box, which implies that the data is not skewed. However, the minimum value is the same as Q1, so that implies there might be a little skewing, though not much.
Example \(\PageIndex{5}\): Find the Five-Number Summary and IQR and Draw a Box Plot (Even Number of Data Points)
The first 12 days of May 2013 in Flagstaff, AZ, had the following high temperatures (in °F):
| 71 | 59 | 69 | 68 | 63 | 57 |
| 57 | 57 | 57 | 65 | 67 | 73 |
(Weather Underground, n.d.)
Find the five-number summary and IQR and draw a box plot.
To find the five-number summary, you must first put the data values in order from smallest to largest. 57, 57, 57, 57, 59, 63, 65, 67, 68, 69, 71, 73
Then find the median. The numbers 63 and 65 are in the middle of the data set, so the median is \(\dfrac{63+65}{2} = 64 ^{\circ}F\).
To find Q1, look at the numbers below the median. Since the number 64 is the median, you include all the numbers below 64, including the 63 that you used to find the median.
To find Q3, look at the numbers above the median. Since the number 64 is the median, you include all the numbers above 64, including the 65 that you used to find the median.
Looking at the numbers below the median (57, 57, 57, 57, 59, 63), the median of those is \(\dfrac{57+57}{2} = 57 ^{\circ}F\). Q1 = 57°F. Looking at the numbers above the median (65, 67, 68, 69, 71, 73), the median of those is \(\dfrac{68+69}{2} = 68.5 ^{\circ}F\). Q3 = 68.5°F.
Now find the minimum and maximum. The minimum is 57°F and the maximum is 73°F.
Thus, the five-number summary is:
Min = 57°F
Q1 = 57°F
Med = Q2 = 64°F
Q3 = 68.5°F
Max = 73°F.
Also, the IQR = Q3 – Q1 = 68.5 – 57 = 11.5°F
Finally, draw a box plot for this data set as follows:
Figure \(\PageIndex{10}\): Box Plot
Temperatures in °F in Flagstaff, AZ, in early May 2013
Notice that the median is basically in the center of the box, so that implies that the data is not skewed. However, the minimum value is the same as Q1, so that implies there might be a little skewing, though not much.
It is important to understand how to find all descriptive statistics by hand and also by using a calculator.
Example \(\PageIndex{6}\): Finding the Descriptive Statistics Using the TI-83/84 Calculator
The first 11 days of May 2013 in Flagstaff, AZ, had the following high temperatures (in °F):
| 71 | 59 | 69 | 68 | 63 | 57 |
| 57 | 57 | 57 | 65 | 67 |
(Weather Underground, n.d.)
Find the descriptive statistics for this data set using the TI-83/84 calculator.
First you need to put the data into the calculator. To do this, press STAT. The STAT button is in the third row of buttons, next to the arrow keys.
Once you press STAT, you will see the following screen:
Choose 1:Edit… and you will see the following:
Note: If there is already data in list 1 (L1), then you should move the cursor up to L1 by using the arrow keys. Then, press clear and enter. This should clear all data from list 1 (L1).
Now type all of the data into list 1 (L1):
Note: Figure \(\PageIndex{14}\) only shows the last six data points entered, but all the data has been entered.
Next, press STAT again and move over to CALC using the right arrow button. You will see the following:
Choose 1:1-Var Stats. This will put 1-Var Stats on your home screen. Type in L1 (2nd 1), and the calculator will show the following:
At this point press ENTER, and you will see the following: (Use the down arrow button to see the rest of the results.)
Therefore, the mean is \(\overline{x} = 62.7^{\circ}F\), the standard deviation is \(s = 5.515^{\circ}F\), and the five-number summary is Min = 57°F, Q1 = 57°F, Med = Q2 = 63°F, Q3 = 68°F, Max = 71°F. You can find the range by subtracting the max and min. You can find IQR by subtracting Q3 and Q1, and you can find the variance by squaring the standard deviation. You cannot find the mode from the calculator. Note that the calculator gives you the population standard deviation \(\sigma = 5.259^{\circ}F\). Notice it is different than the value for \(s\), since they are calculated differently. The value the calculator gives you for the population standard deviation is not the actual true value. The calculator gives you both values because it does not know if you typed in a sample or a population. You can ignore the population standard deviation \(\sigma\) in almost all cases.