3.1: Measures of Center
- Page ID
- 105820
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)Introduction
In the previous section, we learned how to organize and present the collected data visually. While a picture is worth a thousand words, describing a picture may not always be an option. In this section, we will learn how to describe the data and its distribution numerically.
Consider the following example: a zoologist was sent to a distant island to study the population of lemurs. Let’s say they are interested in the ages of lemurs. Of course, the goal is to produce the histogram and that histogram will represent the knowledge of the population of lemurs’ ages! After collecting the data, the zoologist wants to communicate their findings, but the internet is down so the only way to send the information is via telegraph. For those of you who are not familiar with the telegraph – It is the twitter of the 20th century. The solution is to create a numerical summary which then can be communicated via telegram and then interpreted on the other end into the visual summary.
In other words, when studying data, the goal is always to have the visual summary that tells the story. When it is not possible to get the visual summary directly, we compute the numerical summary of data that later can be interpreted to describe the shape of the distribution.
There are three types of numerical measures that we will cover in this course:
- the measures of center or central tendency – describe what considered the most typical observations in the data set;
- the measures of variation or spread – describe how much data varies from the center;
- the measures of position or relative standing – describe the position of each individual observation among the entire data set.
Next, we will discuss different ways to produce the numerical summary of data.
Section 1: Measures of center
In this section, we will learn how to measure the center of the data, or in other words how to define what is a typical observation in the data set.
Descriptive measures that indicate where the center or the most typical value of a data set lies are called measures of central tendency or, more simply, measures of center. The three most common measures of center: the mean, median, and mode.
Consider the following set of a student’s 5 midterm exam scores:
86, 79, 96, 79, 100
The mean of a data set is the sum of the observations divided by the number of observations.
For example, the mean of {86, 79, 96, 79,100} is the sum of all five numbers divided by 5 which is 88.
Let’s drop the lowest exam score so that now we have only one 79:
86, 96, 79, 100
The mean of the four remaining numbers is the sum of all four of them divided by 4 which is 90.25.
The median of a data set is the middle number that divides the bottom 50% of the data from the top 50%.
For example, the median of {79, 79, 86, 96, 100} is 86 because if we arrange the data in increasing order the number 86 is such that there are two numbers below it and the other two numbers are above it!
Let’s drop the lowest exam score so that now we have only one 79! The median of the four remaining numbers is the middle number by definition but since there is an even number of values in the dataset, we are going to obtain the middle value by finding the average of the middle two values. In this example, the median of the data is 91.
The mode of a data set is the most frequently occurring observation in the data set.
For example, the mode of {86, 79, 96, 79,100} is 79 because 79 occurs twice which more than any other observation.
Let’s drop the lowest exam score so that now we have only one 79! The mode of the four remaining numbers is still the most frequent observation which in this case does not exist since every observation occurs the same number of times, so we say such data doesn’t have a mode.
It is worth mentioning that the mean and median apply only to quantitative data, whereas the mode can be used with either quantitative or qualitative data. Thus, while the mode appears to be the least useful among the three measures of central tendency, it is the only way to assess the center for qualitative data! But when working with quantitative data we still have to choose between the mean and the median.
Let’s consider the following example. We have four people Ann, Beth, Connor, and Diana that make the following annual salaries:
$30k, $50k, $50k, $70k
Their average salary is $50,000 and their median is also $50,000 dollars. Now, let’s say Elon moves in and his salary is $300,000 a year. What effect does this have on the mean and median? The new mean is $100,000 and the new median is still $50,000. Which of the two measures produces a more representative value? In this case, the median is obviously a better choice, for example if Toyota Company was considering building a Lexus or Scion dealership and they were only considering the mean it would appear that an average person in this neighborhood can afford a Lexus when in reality only one person can afford it. So, what exactly is the difference between the mean and the median??
To understand the differences between the mean and median we introduce a new term.
A resistant measure is not sensitive to the influence of a few extreme observations.
The median is a resistant measure of center, but the mean is not. Usually, we use the median for house prices and salaries and for everything else we would prefer the mean. One way to make the mean more resistant is to remove a percentage of the smallest and largest observations before computing the mean. Such a measure is called a trimmed (truncated) mean. A professor dropping the lowest midterm exam score is an example of a trimmed (truncated) mean.
Consider a larger data set and compute its mode, mean, and median.
|
42 |
43 |
46 |
47 |
47 |
47 |
48 |
49 |
49 |
|
50 |
51 |
51 |
51 |
51 |
51 |
52 |
52 |
54 |
|
54 |
54 |
54 |
54 |
55 |
55 |
55 |
55 |
56 |
|
56 |
56 |
57 |
57 |
57 |
57 |
58 |
60 |
61 |
|
61 |
61 |
62 |
64 |
64 |
65 |
68 |
69 |
70 |
Solution
The modes are 51 and 54 because they have the highest frequency 5.
The mean is 55.02, the average of all 45 values.
The median is 55 because it is 23rd out of 45 thus it has 22 values below and 22 values above.
We discussed the three most common ways to compute the center of the data set. It should now be clear that the mean, median, and mode generally provide different information. There is no simple rule for deciding which measure of center to use in a given situation. Even experts may disagree about the most suitable measure of center for a particular data set. But sometimes for certain variables the choice is clear.
Section 2: Measures of Center from Frequency Tables
Previously we discussed how to find the mean, median, and mode form a raw data set. Next, we will discuss how to find the center of the data when the frequency distribution is given.
|
Class Midpoint |
Frequency |
|---|---|
|
\(x_1\) |
\(f(x_1)\) |
|
\(x_2\) |
\(f(x_2)\) |
|
… |
… |
|
\(x_k\) |
\(f(x_k)\) |
|
Total: |
\(n\) |
When the data is given by the frequency distribution table the easiest measure to compute is the mode. The mode is the midpoint of the class with the largest frequency. If there are several largest frequencies, then there are many modes. If all frequencies are the same, then there are no modes.
Consider the following frequency table produced by a single value grouping.
|
Class |
Frequency |
|---|---|
|
54 |
2 |
|
60 |
4 |
|
71 |
3 |
|
80 |
1 |
|
Total |
10 |
Solution
The mode is 60 because 60 has the highest frequency 4.
How to find the mean of the data set that is given by the frequency distribution? The mean can be found by using the following formula in which we multiply each observation by the frequency, then add them together and divide by the total:
\(\frac{x_1\cdot{f(x_1)}+x_2\cdot{f(x_2)}+\cdots+x_k\cdot{f(x_k)}}{n}\)
Consider the following frequency table again.
|
Class |
Frequency |
|---|---|
|
54 |
2 |
|
60 |
4 |
|
71 |
3 |
|
80 |
1 |
|
Total |
10 |
Solution
According to the formula we compute the mean by multiplying each observation by its frequency, add them together, and divide by the total. The mean in this example is
\(\frac{54\cdot2+60\cdot4+71\cdot3+80\cdot1}{2+4+3+1}=64.1\)
The least trivial measure of center to find from the frequency table is the median. To find the median, we are going to use the cumulative frequencies and the following process that depends on whether the number of observations is odd or even.
|
Class Midpoint |
Class Midpoint |
Frequency |
|---|---|---|
|
\(x_1\) |
\(f(x_1)\) |
\(cf(x_1)\) |
|
\(x_2\) |
\(f(x_2)\) |
\(cf(x_2)\) |
|
… |
… |
… |
|
\(x_k\) |
\(f(x_k)\) |
\(cf(x_k)\) |
|
Total: |
\(n\) |
When \(n\) is odd, the median is the midpoint of the first class for which the cumulative frequency exceeds \(\frac{n-1}{2}\); or when \(n\) is even, the median is the average of the midpoint of the first class for which the cumulative frequency exceeds \(\frac{n}{2}-1\) and the midpoint of the first class for which the cumulative frequency exceeds \(\frac{n}{2}\). Often these are the same midpoints.
Consider the following frequency table again along with the cumulative frequencies.
|
Class |
Frequency |
Cumulative Frequency |
|---|---|---|
|
54 |
2 |
2 |
|
60 |
4 |
6 |
|
71 |
3 |
9 |
|
80 |
1 |
10 |
|
Total |
10 |
Solution
Since \(n=10\) is even, we find the midpoint of the first class for which the cumulative frequency exceeds \(\frac{10}{2}-1=4\) - it is 60, and then we find the midpoint of the first class for which the cumulative frequency exceeds \(\frac{10}{2}=5\) and it is again 60. The average of the two is 60.
Let’s consider the following frequency table produced by an interval grouping.
|
Class |
Frequency |
|---|---|
|
40 to 45 |
2 |
|
45 to 50 |
7 |
|
50 to 55 |
13 |
|
55 to 60 |
12 |
|
60 to 65 |
7 |
|
65 to 70 |
3 |
|
70 to 75 |
1 |
|
Total |
45 |
Solution
The first thing we need to do is to replace each class with the class representative in this case the class’s midpoint and then add the cumulative frequency column:
|
Class |
Midpoint |
Frequency |
Cumulative Frequency |
|---|---|---|---|
|
40 to 45 |
42.5 |
2 |
2 |
|
45 to 50 |
47.5 |
7 |
9 |
|
50 to 55 |
52.5 |
13 |
22 |
|
55 to 60 |
57.5 |
12 |
34 |
|
60 to 65 |
62.5 |
7 |
41 |
|
65 to 70 |
67.5 |
3 |
44 |
|
70 to 75 |
72.5 |
1 |
45 |
|
Total |
45 |
Now we are going to do the same thing as before using the class midpoints as the data entries.
- The mode is 52.5 because 52.5 is the midpoint of the class that has the highest frequency 13.
- The mean is found by using the formula and it is equal to \(\frac{42.5\cdot2+47.5\cdot7+52.5\cdot13+57.5\cdot12+62.5\cdot7+67.5\cdot3+72.5\cdot1}{2+7+13+12+7+3+1}=55.61\)
- To find the median we consider the column with the cumulative frequency values and since the cumulative frequency \(34\) is the first one that exceeds \(\frac{45-1}{2}=22\) we conclude that the median is 57.5.
You may not recognize this data set but previously we have found the measures of center of the same data:
|
42 |
43 |
46 |
47 |
47 |
47 |
48 |
49 |
49 |
|
50 |
51 |
51 |
51 |
51 |
51 |
52 |
52 |
54 |
|
54 |
54 |
54 |
54 |
55 |
55 |
55 |
55 |
56 |
|
56 |
56 |
57 |
57 |
57 |
57 |
58 |
60 |
61 |
|
61 |
61 |
62 |
64 |
64 |
65 |
68 |
69 |
70 |
Let’s compare the results obtained using the original data against the results that we just obtained from the frequency table:
|
Measure of the Center |
Original Data |
Frequency Table |
|---|---|---|
|
Mean |
55.02 |
55.61 |
|
Median |
55 |
57.5 |
|
Mode |
51, 54 |
52.5 |
Why do the answers not match? Recall that the frequency table is created by grouping the observations into classes which is the process that cannot be undone! When we use the frequency table, we essentially assume that all observations in each class are the same as the class's midpoint. As a result, the numerical summaries obtained from the frequency table are only approximations of the numerical summaries obtained from the original data set.
Now we know how to find the mean, median, and mode regardless of how the data is provided. If the formulas above, do not make sense then reconstruct the entire data set from the frequency table and find the measures using their definitions above! However, understanding the formulas simply requires an understanding of the meaning of the frequency in the frequency table and nothing more.


