5.3: Numerical Measures of Center and Variation
It is often desirable to use a few numbers to summarize a distribution. One important aspect of a distribution is where its center is located. Measures of central tendency are discussed first. A second aspect of a distribution is how spread out it is. In other words, how much the data in the distribution vary from one another. The second part of this section describes measures of variation.
Measures of Central Tendency
Let's begin by trying to find the most "typical" value of a data set.
Note that we just used the word "typical" although in many cases you might think of using the word "average." We need to be careful with the word "average" as it means different things to different people in different contexts. One of the most common uses of the word "average" is what mathematicians and statisticians call the “arithmetic mean”, or just plain old “mean” for short. "Arithmetic mean" sounds rather fancy, but you have likely calculated a mean many times without realizing it; the mean is what most people think of when they use the word "average.”
Mean
The mean of a set of data is the sum of the data values divided by the number of values.
Example \(\PageIndex{1}\)
Marci’s exam scores for her last math class were 79, 86, 82, 94. Find the mean.
Solution
The mean of these values would be:
79+86+82+944=85.25
Example \(\PageIndex{2}\)
The number of touchdown (TD) passes thrown by each of the 31 teams in the National Football League in the 2000 season are shown below.
37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20
20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6
What is the mean number of touchdown passes thrown in the NFL in the 2000 season?
Solution
Adding these values, we get 634 total TDs. Dividing by 31, the number of data values, we get 634/31 = 20.4516. It would be appropriate to round this to 20.5.
It would be correct for us to report that “The mean number of touchdown passes thrown in the NFL in the 2000 season was 20.5 passes,” but it is not uncommon to see the more casual word “average” used in place of “mean.”
Try it Now 4
The prices of a jar of peanut butter at 5 stores were $3.29, $3.59, $3.79, $3.75, and $3.99. Find the mean price.
Example \(\PageIndex{3}\)
One hundred families in a particular neighborhood are asked their annual household income, to the nearest $5 thousand dollars. The results are summarized in a frequency table below.
|
Income (thousands of dollars) |
Frequency |
|---|---|
|
15 |
6 |
|
20 |
8 |
|
25 |
11 |
|
30 |
17 |
|
35 |
19 |
|
40 |
20 |
|
45 |
12 |
|
50 |
7 |
Calculate the mean.
Solution
Calculating the mean by hand could get tricky if we try to type in all 100 values:
\[\dfrac{\overset{6 \text { terms }}{\overbrace{15+\cdots 15}}\; + \overset{8 \text { terms }}{\overbrace{20+\cdots 20}} + \overset{11 \text { terms }}{\overbrace{25+\cdots 25}} + \cdots}{100} \nonumber \]
We could calculate this more easily by noticing that adding 15 to itself six times is the same as 90. Using this simplification, we get:
\[\dfrac{15 \cdot 6+20 \cdot 8+25 \cdot 11+30 \cdot 17+35 \cdot 19+40 \cdot 20+45 \cdot 12+50 \cdot 7}{100}=\dfrac{3390}{100}=33.9 \nonumber \]
The mean household income of our sample is 33.9 thousand dollars ($33,900).
Example \(\PageIndex{4}\)
Extending off the last example, suppose a new family moves into the neighborhood example that has a household income of $5 million ($5000 thousand).
Solution
Adding this to our sample, our mean is now:
\[\dfrac{15 \cdot 6+20 \cdot 8+25 \cdot 11+30 \cdot 17+35 \cdot 19+40 \cdot 20+45 \cdot 12+50 \cdot 7+5000 \cdot 1}{101}=\frac{8390}{101}=83.069 \nonumber \]
While 83.1 thousand dollars ($83,069) is the correct mean household income, it no longer represents a “typical” value.
An extreme value, much bigger or much smaller than most other values in a given dataset, is called an “outlier”. An outlier can affect the mean.
Imagine the data values on a see-saw or balance scale. The mean is the value that keeps the data in balance, like in the picture below.
If we graph our household data, the $5 million data value is so far out to the right that the mean has to adjust up to keep things in balance.
In situations like this, where one value is much bigger or smaller than most other value in the data, there is a better measure of center than the mean.
Median
The median of a set of data is the value in the middle when the data is in order.
To find the median, begin by listing the data in order from smallest to largest, or largest to smallest.
If the number of data values, \(N\), is odd, then the median is the middle data value. This can be found by rounding \(N/2\) up to the next whole number.
If the number of data values is even, there is no one middle value, so we find the mean of the two middle values (values \(N/2\) and \(N/2 + 1\) ).
Example \(\PageIndex{5}\)
Returning to the football touchdown data, we would start by listing the data in order. Luckily, it was already in decreasing order, so we can work with it without needing to reorder it first.
37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20
20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14, 14, 12, 12, 9, 6
Find the median touchdown pass.
Solution
Since there are 31 data values, an odd number, the median will be the middle number, the 16th data value (31/2 = 15.5, round up to 16, leaving 15 values below and 15 above). The 16th data value is 20, so the median number of touchdown passes in the 2000 season was 20 passes. Notice that for this data, the median is fairly close to the mean we calculated earlier, 20.5.
Example \(\PageIndex{6}\)
Find the median of these quiz scores: 5, 10, 8, 6, 4, 8, 2, 5, 7, 7
Solution
We start by listing the data in order: 2, 4, 5, 5, 6, 7, 7, 8, 8, 10
Since there are 10 data values, an even number, there is no one middle number. So we find the mean of the two middle numbers, 6 and 7, and get (6+7)/2 = 6.5.
The median quiz score was 6.5.
Try it Now 5
The prices of a jar of peanut butter at 5 stores were $3.29, $3.59, $3.79, $3.75, and $3.99. Find the median price.
Example \(\PageIndex{7}\)
Let us return now to our original household income data.
|
Income (thousands of dollars) |
Frequency |
|---|---|
|
15 |
6 |
|
20 |
8 |
|
25 |
11 |
|
30 |
17 |
|
35 |
19 |
|
40 |
20 |
|
45 |
12 |
|
50 |
7 |
Find the mean and median income.
Solution
Here we have 100 data values. If we didn’t already know that, we could find it by adding the frequencies. Since 100 is an even number, we need to find the mean of the middle two data values - the 50th and 51st data values. To find these, we start counting up from the bottom:
There are 6 data values of $15, so Values 1 to 6 are $15 thousand
The next 8 data values are $20, so Values 7 to (6+8)=14 are $20 thousand
The next 11 data values are $25, so Values 15 to (14+11)=25 are $25 thousand
The next 17 data values are $30, so Values 26 to (25+17)=42 are $30 thousand
The next 19 data values are $35, so Values 43 to (42+19)=61 are $35 thousand
From this we can tell that values 50 and 51 will be $35 thousand, and the mean of these two values is $35 thousand. The median income in this neighborhood is $35 thousand.
Example \(\PageIndex{8}\)
If we add in the new neighbor with a $5 million household income, how does the median change?
Solution
If we add in the new neighbor with a $5 million household income, then there will be 101 data values, and the 51st value will be the median. As we discovered in the last example, the 51st value is $35 thousand. Notice that the new neighbor did not affect the median in this case. The median is not swayed as much by outliers as the mean is.
This is the main reason the median, not the mean, is used to represent certain data, such as the average household income or home price.
In addition to the mean and the median, there is one other common measurement of the "typical" value of a data set: the mode.
Mode
The mode is the element of the data set that occurs most frequently.
The mode is fairly useless with data like weights or heights where there are a large number of possible values. The mode is most commonly used for categorical data, for which median and mean cannot be computed.
Example \(\PageIndex{9}\)
In our vehicle color survey, we collected the data
|
Color |
Frequency |
|---|---|
|
Blue |
3 |
|
Green |
5 |
|
Red |
4 |
|
White |
3 |
|
Black |
2 |
|
Grey |
3 |
What is the mode?
Solution
For this data, Green is the mode, since it is the data value that occurred the most frequently.
It is possible for a data set to have more than one mode (if several categories have the same frequency), or no modes (if every category occurs only once.)
Try it Now 6
Reviewers were asked to rate a product on a scale of 1 to 5. Find
- The mean rating
- The median rating
- The mode rating
|
Rating |
Frequency |
|---|---|
|
1 |
4 |
|
2 |
8 |
|
3 |
7 |
|
4 |
3 |
|
5 |
1 |
Measures of Variation
Consider these three sets of quiz scores:
Section A: 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
Section B: 0, 0, 0, 0, 0, 10, 10, 10, 10, 10
Section C: 4, 4, 4, 5, 5, 5, 5, 6, 6, 6
All three of these sets of data have a mean of 5 and median of 5, yet the sets of scores are clearly quite different. In section A, everyone had the same score; in section B half the class got no points and the other half got a perfect score, assuming this was a 10-point quiz. Section C was not as consistent as section A, but not as widely varied as section B.
In addition to the mean and median, which are measures of the "typical" or "middle" value, we also need a measure of how "spread out" or varied each data set is.
There are several ways to measure this "spread" of the data. The first is the simplest and is called the range.
Range
The range is the difference between the maximum value and the minimum value of the data.
Example \(\PageIndex{10}\)
Using the quiz scores from above, find the range for each section.
Solution
For section A, the range is 0 since both maximum and minimum are 5 and 5 – 5 = 0.
For section B, the range is 10 since 10 – 0 = 10.
For section C, the range is 2 since 6 – 4 = 2.
In the last example, the range seems to be revealing how spread out the data is.
However, suppose we add a fourth section, Section D, with scores 0, 5, 5, 5, 5, 5, 5, 5, 5, 10.
This section also has a mean and median of 5. The range is 10, yet this data set is quite different from Section B. To better illuminate the differences, we’ll have to turn to more sophisticated measures of variation.
Standard Deviation
The standard deviation is a measure of variation based on measuring how far each data value deviates, or is different, from the mean. A few important characteristics:
- Standard deviation is always positive. Standard deviation will be zero only if all the data values are equal and will get larger as the data spreads out.
- Standard deviation has the same unit as the original data.
- Standard deviation, like the mean, can be highly influenced by outliers.
Standard deviation is rather complicated. Let me explain below, step by step.
Using the data from section D (Example \(\PageIndex{10}\)), we could compute for each data value the difference between the data value and the mean:
|
Data value |
Deviation: Data value - Mean |
|---|---|
|
0 |
0-5 = -5 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
5 |
5-5 = 0 |
|
10 |
10-5 = 5 |
We would like to get an idea of the "average" deviation from the mean, but if we find the average of the values in the second column, the negative and positive values cancel each other out, resulting in the average deviation of 0 (this will always happen). To prevent this we square every value in the second column:
|
Data value |
Deviation: Data value - Mean |
Deviation squared |
|---|---|---|
|
0 |
0-5 = -5 |
\((-5)^2 = 25\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
5 |
5-5 = 0 |
\(0^2 = 0\) |
|
10 |
10-5 = 5 |
\((5)^2 = 25\) |
We then add the squared deviations up to get 25 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 25 = 50. Ordinarily we would then divide by the number of scores, n (in this case, 10), to find the mean of the squares of the deviations. But we only do this if the data set represents a population; if the data set represents a sample (as it almost always does), we instead divide by \(n - 1\) (in this case, 10-1=9).
The quotient (by \(n\) or by \(n - 1\)) is called the “variance” of the data set.
So in our example, we would have 50/10 = 5 if section D represents a population and 50/9 = about 5.56 if section D represents a sample. These values (5 and 5.56) are called, respectively, the population variance and the sample variance for section D.
We are almost there—one more step, and we will find the standard deviation.
Variance can be a useful statistical concept, but note that the unit of variance in this instance would be points-squared since we squared all of the deviations. What are points-squared? Good question. We would rather deal with the units we started with (points in this case), so to convert back we take the square root and get:
Population standard deviation=\(\sqrt{\dfrac{50}{10}}=\sqrt{5} \approx 2.2\)
or
sample standard deviation=\(\sqrt{\dfrac{50}{9}} \approx 2.4\)
If we are unsure whether the data set is a sample or a population, we will usually assume it is a sample, and we will round answers to one more decimal place than the original data as we have done above.
As you can see, it takes multiple steps to calculate the standard deviation of a dataset. For most real-life applications, the calculations are performed using technology. Here is a brief summary.
To compute Standard Deviation
- Find the deviation of each data value from the mean. In other words, subtract the mean from the data value.
- Square each deviation.
- Add the squared deviations.
- Divide by \(n\), the number of data values, if the data represents a whole population; divide by \(n - 1\) if the data is from a sample.
- Compute the square root of the result.
Example \(\PageIndex{11}\)
Compute the standard deviation for Section B above
Solution
Computing the standard deviation for Section B above, we first calculate that the mean is 5. Using a table can help keep track of your computations for the standard deviation:
|
Data value |
Deviation: Data value - Mean |
Deviation squared |
|---|---|---|
|
0 |
0-5 = -5 |
\((-5)^2 = 25\) |
|
0 |
0-5 = -5 |
\((-5)^2 = 25\) |
|
0 |
0-5 = -5 |
\((-5)^2 = 25\) |
|
0 |
0-5 = -5 |
\((-5)^2 = 25\) |
|
0 |
0-5 = -5 |
\((-5)^2 = 25\) |
|
10 |
10-5 = 5 |
\((5)^2 = 25\) |
|
10 |
10-5 = 5 |
\((5)^2 = 25\) |
|
10 |
10-5 = 5 |
\((5)^2 = 25\) |
|
10 |
10-5 = 5 |
\((5)^2 = 25\) |
|
10 |
10-5 = 5 |
\((5)^2 = 25\) |
Assuming this data represents a population, we will add the squared deviations, divide by 10, the number of data values, and compute the square root:
\[\sqrt{\dfrac{25+25+25+25+25+25+25+25+25+25}{10}}=\sqrt{\dfrac{250}{10}}=5 \nonumber \]
Notice that the standard deviation of this data set is much larger than that of section D since the data in this set is more spread out.
For comparison, the standard deviations of all four sections are as follows:
| Section | Standard deviation |
|---|---|
|
Section A: 5, 5, 5, 5, 5, 5, 5, 5, 5, 5 |
Standard deviation: 0 |
|
Section B: 0, 0, 0, 0, 0, 10, 10, 10, 10, 10 |
Standard deviation: 5 |
|
Section C: 4, 4, 4, 5, 5, 5, 5, 6, 6, 6 |
Standard deviation: 0.8 |
|
Section D: 0, 5, 5, 5, 5, 5, 5, 5, 5, 10 |
Standard deviation: 2.2 |
Try it Now 7
The prices of a jar of peanut butter at 5 stores were: $3.29, $3.59, $3.79, $3.75, and $3.99. Find the standard deviation of the prices.
Where standard deviation is a measure of variation based on the mean, quartiles are based on the median.
Quartiles
Quartiles are values that divide the data in quarters.
The first quartile (Q1) is the value so that 25% of the data values are below it; the third quartile (Q3) is the value so that 75% of the data values are below it. You may have guessed that the second quartile is the same as the median, since the median is the value so that 50% of the data values are below it.
This divides the data into quarters; 25% of the data is between the minimum and Q1, 25% is between Q1 and the median, 25% is between the median and Q3, and 25% is between Q3 and the maximum value
While quartiles are not a 1-number summary of variation like standard deviation, the quartiles are used with the median, minimum, and maximum values to form a 5-number summary of the data.
Five-number Summary
The five-number summary takes this form:
Minimum, Q1, Median, Q3, Maximum.
To find the first quartile, we need to find the data value so that 25% of the data is below it.
But note that this is the same as the “median” of the first (lower) half of the dataset. So an easy way to find the first quartile is to split the original data set in half and find the median of the lower half. The third quartile is similar; find the median of the second (higher) half of the data set.
Warning: The precise definitions and method of finding quartiles vary slightly from book to book. You may see slightly different explanations in other resources.
Example \(\PageIndex{12}\)
Suppose we have measured 9 females and their heights (in inches), sorted from smallest to largest are:
59, 60, 62, 64, 66, 67, 69, 70, 72
Find Quartiles.
Solution
Note the median is 66.
The lower half, {59, 60, 62, 64}, has the median of 61, and the higher half, {67, 69, 70, 72}, has the median of 69.5.
So Q1 is 61, Q2 is 66, and Q3 is 69.5.
Example \(\PageIndex{13}\)
Suppose we had measured 8 females and their heights (in inches), sorted from smallest to largest are:
59, 60, 62, 64, 66, 67, 69, 70
Find Quartiles.
Solution
Note the median in this case is 65, the mean of 64 and 66.
Then the lower half is the same as in Example 25 (its median is 61). The higher half now has the median 68.
So, Q1 is 61, Q2 is 65, and Q3 is 68.
The 5-number summary combines the first and third quartile with the minimum, median, and maximum values.
Example \(\PageIndex{14}\)
For the 9 female sample, the median is 66, the minimum is 59, and the maximum is 72. The 5- number summary is: 59, 61, 66, 69.5, 72.
For the 8 female sample, the median is 65, the minimum is 59, and the maximum is 70, so the 5- number summary would be: 59, 61, 65, 68, 70.
Example \(\PageIndex{15}\)
Returning to our quiz score data. In each case, there are 10 values, so the median is the mean of the 5th and 6th values. The first quartile will be the 3rd data value, and the third quartile will be the 8th data value. Creating the five-number summaries:
|
Section and data |
5-number summary |
|---|---|
|
Section A: 5, 5, 5, 5, 5, 5, 5, 5, 5, 5 |
5, 5, 5, 5, 5 |
|
Section B: 0, 0, 0, 0, 0, 10, 10, 10, 10, 10 |
0, 0, 5, 10, 10 |
|
Section C: 4, 4, 4, 5, 5, 5, 5, 6, 6, 6 |
4, 4, 5, 6, 6 |
|
Section D: 0, 5, 5, 5, 5, 5, 5, 5, 5, 10 |
0, 5, 5, 5, 10 |
Of course, with a relatively small data set, finding a five-number summary is a bit silly since the summary contains almost as many values as the original data.
Try it Now 8
The total cost of textbooks for the term was collected from 36 students. Find the 5-number summary of this data.
$140 $160 $160 $165 $180 $220 $235 $240 $250 $260 $280 $285
$285 $285 $290 $300 $300 $305 $310 $310 $315 $315 $320 $320
$330 $340 $345 $350 $355 $360 $360 $380 $395 $420 $460 $460
Example \(\PageIndex{16}\)
Returning to the household income data from earlier, create the 5-number summary.
|
Income (thousands of dollars) |
Frequency |
|---|---|
|
15 |
6 |
|
20 |
8 |
|
25 |
11 |
|
30 |
17 |
|
35 |
19 |
|
40 |
20 |
|
45 |
12 |
|
50 |
7 |
Solution
By adding the frequencies, we can see there are 100 data values represented in the table. In Example 20, we found the median was $35 thousand. We can see in the table that the minimum income is $15 thousand, and the maximum is $50 thousand.
Counting up in the data as we did before,
There are 6 data values of $15, so Values 1 to 6 are $15 thousand
The next 8 data values are $20, so Values 7 to (6+8)=14 are $20 thousand
The next 11 data values are $25, so Values 15 to (14+11)=25 are $25 thousand
The next 17 data values are $30, so Values 26 to (25+17)=42 are $30 thousand
The lower half has 50 values, so Q1 is the mean of the 25th and the 26th values.
The 25th data value is $25 thousand, and the 26th data value is $30 thousand, so Q1 will be the mean of these: (25 + 30)/2 = $27.5 thousand.
Similarly, Q3 will be the mean of the 75th and 76th data values. Continuing our counting from earlier,
The next 19 data values are $35, so Values 43 to (42+19)=61 are $35 thousand
The next 20 data values are $40, so Values 61 to (61+20)=81 are $40 thousand
Both the 75th and 76th data values lie in this group, so Q3 will be $40 thousand.
Putting these values together into a five-number summary, we get: 15, 27.5, 35, 40, 50.
Note that the 5-number summary divides the data into four intervals, each of which will contain about 25% of the data. In the previous example, that means about 25% of households have income between $40 thousand and $50 thousand.
For visualizing data, there is a graphical representation of a 5-number summary called a box plot, or box and whisker graph.
Box Plot
A box plot is a graphical representation of a 5-number summary.
To create a box plot, a number line is first drawn. A box is drawn from the first quartile to the third quartile, and a line is drawn through the box at the median. “Whiskers” are extended out to the minimum and maximum values.
Example \(\PageIndex{17}\)
Draw a box plot below based on the 5-number summary:
59, 62, 66, 69, 72.
Solution
Example \(\PageIndex{18}\)
Draw a box plot below based on the household income data with 5 number summary:
15, 27.5, 35, 40, 50
Solution
Try it Now 9
Create a boxplot based on the textbook price data from the last Try it Now.
Box plots are particularly useful for comparing data from two populations.
Example \(\PageIndex{19}\)
The box plot of service times for two fast-food restaurants is shown below.
While store 2 had a slightly shorter median service time (2.1 minutes vs. 2.3 minutes), store 2 is less consistent, with a wider spread of the data.
At store 1, 75% of customers were served within 2.9 minutes, while at store 2, 75% of customers were served within 5.7 minutes.
Which store should you go to in a hurry?
Solution
That depends upon your opinions about luck – 25% of customers at store 2 had to wait between 5.7 and 9.6 minutes.
Example \(\PageIndex{20}\)
The boxplot below is based on the birth weights of infants with severe idiopathic respiratory distress syndrome (SIRDS). The boxplot is separated to show the birth weights of infants who survived and those that did not.
Compare the two groups to determine if birth weight is quite likely linked to survival of infants with SIRDS.
Solution
Comparing the two groups, the boxplot reveals that the birth weights of the infants that died appear to be, overall, smaller than the weights of infants that survived. In fact, we can see that the median birth weight of infants that survived is the same as the third quartile of the infants that died.
Similarly, we can see that the first quartile of the survivors is larger than the median weight of those that died, meaning that over 75% of the survivors had a birth weight larger than the median birth weight of those that died.
Looking at the maximum value for those that died and the third quartile of the survivors, we can see that over 25% of the survivors had birth weights higher than the heaviest infant that died.
The box plot gives us a quick, albeit informal, way to determine that birth weight is quite likely linked to survival of infants with SIRDS.
Contributors and Attributions
-
Saburo Matsumoto
CC-BY-4.0