4.6: Measures of the Location of the Data and Outliers
- Recognize, describe, and calculate the measures of position.
- Identify an outlier.
In this section, we introduce the measures of relative standing, the purpose of which is to tell us how each individual value compares against the other values within the same data set. The easiest way to measure the relative standing of each observation is to rank them from the smallest to the largest. However, this is easier said than done for large data sets, so we will focus on the other two ways to measure the relative standing for each observation: the z-score and the percentile .
Z-Scores
Consider a population with known mean \(\mu\) and known standard deviation \(\sigma\), then the z -score of some observation \(x\) is:
\[z = \dfrac{x - \mu}{\sigma} \label{zscore}\]
The term standard score is often used instead of z-score. A negative z-score indicates that the observation is below (less than) the mean, whereas a positive z-score indicates that the observation is above (greater than) the mean.
The z-score of an observation, therefore, can be used as a rough measure of its relative standing among all the observations comprising a data set. For instance, a z-score of 3 or more indicates that the observation is larger than most of the other observations; a z-score of −3 or less indicates that the observation is smaller than most of the other observations; a z-score near 0 indicates that the observation is located near the mean.
If the mean of a normal distribution is five and the standard deviation is two, find the z-score of the value 11.
Solution
The calculation is as follows:
\(z = \dfrac{11-5}{2} = 3\)
What is the \(z\)-score of \(x = 1\) in the data set with mean 12 and standard deviation 3?
- Answer
-
\(z = \dfrac{1-12}{3} \approx -3.67\)
If two distributions have the same shape or, more generally, if they differ only by center and variation, then z-scores can be used to compare the relative standings of two observations from those distributions.
Two swimmers, Angie and Beth, from different teams, wanted to find out who had the fastest time for the 50 meter freestyle when compared to her team. Which swimmer had the fastest time when compared to her team?
| Swimmer | Time (seconds) | Team Mean Time | Team Standard Deviation |
|---|---|---|---|
| Angie | 26.2 | 27.2 | 0.8 |
| Beth | 27.3 | 30.1 | 1.4 |
Solution
For Angie:
\[z = \dfrac{26.2-27.2}{0.8} = -1.25 \nonumber\]
For Beth:
\[z = \dfrac{27.3-30.1}{1.4} = -2 \nonumber\]
Since Beth's z-score is smaller than Angie's z-score, we conclude that Beth has a relatively faster time when compared to her team.
Two baseball players, Fredo and Karl, on different teams wanted to find out who had the higher batting average when compared to his team. Which baseball player had the higher batting average when compared to his team?
| Baseball Player | Batting Average | Team Batting Average | Team Standard Deviation |
|---|---|---|---|
| Fredo | 0.158 | 0.166 | 0.012 |
| Karl | 0.177 | 0.189 | 0.015 |
- Answer
-
For Fredo:
\(z\) = \(\dfrac{0.158-0.166}{0.012}\) = –0.67
For Karl:
\(z\) = \(\dfrac{0.177-0.189}{0.015}\) = –0.8
Fredo’s z -score of –0.67 is higher than Karl’s z -score of –0.8. For batting average, higher values are better, so Fredo has a better batting average compared to his team
Recall that if a variable has a bell-shape distribution with mean μ and standard deviation σ, then the Empirical Rule says the following:
- About 68% of the values lie between –1σ and +1σ of the mean μ (within one standard deviation of the mean).
- About 95% of the values lie between –2σ and +2σ of the mean μ (within two standard deviations of the mean).
- About 99.7% of the values lie between –3σ and +3σ of the mean μ (within three standard deviations of the mean).
Also, note that
- The z-scores for +1σ and –1σ are +1 and –1, respectively.
- The z-scores for +2σ and –2σ are +2 and –2, respectively.
- The z-scores for +3σ and –3σ are +3 and –3 respectively.
So that the Empirical Rule says the following:
- About 68% of the values have z-scores between –1 and +1.
- About 95% of the values have z-scores between –2 and +2.
- About 99.7% of the values have z-scores between –3 and +3.
An outlier is an observation that is significantly different from the other data points. These special data points may be errors or some kind of abnormality or they may be a key to understanding the data.
Since most of the observations in any dataset have the z-score between -3 and 3, any observation with the z-score less than -3 or more than 3 is considered to be an outlier as it very unlikely.
The mean height of 15 to 18-year-old males from Chile from 2009 to 2010 was 170 cm with a standard deviation of 6 cm. Suppose a male from Chile was 158 cm tall from 2009 to 2010. Is he an outlier?
Solution
The z-score of 158 is
\[z = \dfrac{158-170}{6} = \dfrac{-12}{6} = -2 \nonumber\]
Since -2 is between -3 and 3, the male is not an outlier.
Jerome averages 18 points a game with a standard deviation of 3 points. Suppose Jerome scores 28 points in one game. Is this result an outlier?
- Answer
-
The z-score of 28 is
\[z = \dfrac{28-18}{3} = \dfrac{10}{3} = 3.33 \nonumber\]
Since 3.33 is greater than 3, this result is an outlier.
Percentiles
Another way to measure the relative standing is computing the percentiles . The p-th percentile is a value that separates the bottom p% of the data form the top (100-p)% of the data. In other words, the percentile indicates the relative standing of an observation when data are sorted from smallest to largest. For example, 15% of data values are less than or equal to the 15 th percentile.
- Low percentiles always correspond to lower data values.
- High percentiles always correspond to higher data values.
A percentile may or may not correspond to a value judgment about whether it is "good" or "bad." The interpretation of whether a certain percentile is "good" or "bad" depends on the context of the situation to which the data applies. In some situations, a low percentile would be considered "good;" in other contexts a high percentile might be considered "good". In many situations, there is no value judgment that applies. Understanding how to interpret percentiles properly is important not only when describing data, but also when calculating probabilities in later chapters of this text.
When writing the interpretation of a percentile in the context of the given data, the sentence should contain the following information.
- information about the context of the situation being considered
- the data value (value of the variable) that represents the percentile
- the percent of individuals or items with data values below the percentile
- the percent of individuals or items with data values above the percentile.
Also, note that to score in the 90 th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score.
At a community college, it was found that the 30 th percentile of credit units that students are enrolled for is seven units. Interpret the 30 th percentile in the context of this situation.
Solution
- Thirty percent of students are enrolled in seven or fewer credit units.
- Seventy percent of students are enrolled in seven or more credit units.
- In this example, there is no "good" or "bad" value judgment associated with a higher or lower percentile. Students attend community college for varied reasons and needs, and their course load varies according to their needs.
On a 60-point written assignment, the 80 th percentile for the number of points earned was 49. Interpret the 80 th percentile in the context of this situation.
- Answer
-
Eighty percent of students earned 49 points or fewer. Twenty percent of students earned 49 or more points. A higher percentile is good because getting more points on an assignment is desirable.
To find the percentile of a particular observation, we use the following procedure:
- Order the data from smallest to largest.
- \(x =\) the number of data values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile.
- \(n =\) the total number of data.
- Calculate \(\dfrac{x}{n}(100)\).
- Round to the nearest integer.
It is common to define \(x\) as the the number of data values counting from the bottom of the data list up to and including the data value. This may slightly change the result of the computation. However, percentiles are mostly used with very large populations, thus, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant.
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77- Find the percentile for 58.
- Find the percentile for 25.
Solution
-
Counting from the bottom of the list, there are 18 data values less than 58.
\(x = 18\) and \(n = 29\) thus \(\dfrac{x}{n}(100) = \dfrac{18}{29}(100) = 62.07\). 58 is the 62 nd percentile.
-
Counting from the bottom of the list, there are three data values less than 25.
\(x = 3\) and \(n = 29\). \(\dfrac{x}{n}(100) = \dfrac{3}{29}(100) = 10.34\). 25 is the 10 th percentile.
Listed are 30 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31, 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
Find the percentiles for 47 and 31.
- Answer
-
Percentile for 47: Counting from the bottom of the list, there are 15 data values less than 47.
\(x = 15\) and \(n = 30\). \(\dfrac{x }{n}(100) = \dfrac{15 }{30}(100) = 50\). 47 is the 50 th percentile.
Percentile for 31: Counting from the bottom of the list, there are eight data values less than 31.
\(x = 8\) and \(y = 30\). \(\dfrac{x }{n}(100) = \dfrac{8}{30}(100) = 26.67\). 31 is the 27 th percentile.
Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles extensively. One instance in which colleges and universities use percentiles is when SAT results are used to determine a minimum testing score that will be used as an acceptance factor. For example, suppose Duke accepts SAT scores at or above the 75 th percentile. That translates into a score of at least 1220.
If you were to do a little research, you would find several formulas for calculating the kth percentile. Here is one of them.
- \(p =\) the p-th percentile. It may or may not be part of the data.
- \(i =\) the index (ranking or position of a data value)
- \(n =\) the total number of data
- Order the data from smallest to largest.
- Calculate \(i = \dfrac{k}{100}(n + 1)\)
- If \(i\) is an integer, then the \(k^{th}\) percentile is the data value in the \(i^{th}\) position in the ordered set of data. If \(i\) is not an integer, then round \(i\) up and round \(i\) down to the nearest integers. The p-th percentile is the average the two data values in these two positions in the ordered data set.
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77- Find the 70 th percentile.
- Find the 83 rd percentile.
Solution
-
- \(k = 70\)
- \(i\) = the index
- \(n = 29\)
-
- \(k\) = 83 rd percentile
- \(i = the index\)
- \(n = 29\)
Listed are 29 ages for Academy Award winning best actors in order from smallest to largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73; 74; 76; 77
Calculate the 20 th percentile and the 55 th percentile.
Answer
\(k = 20\). Index \(= i = \dfrac{k}{100}(n+1) = \dfrac{20}{100}(29 + 1) = 6\). The age in the sixth position is 27. The 20 th percentile is 27 years.
\(k = 55\). Index \(= i = \dfrac{k}{100}(n+1) = \dfrac{55}{100}(29 + 1) = 16.5\). Round down to 16 and up to 17. The age in the 16 th position is 52 and the age in the 17 th position is 55. The average of 52 and 55 is 53.5. The 55 th percentile is 53.5 years.
Quartiles are special percentiles. The first quartile, Q 1 , is the same as the 25 th percentile, and the third quartile, Q 3 , is the same as the 75 th percentile. The second quartile, Q 2 , is also known as the median , M , and is the same as the 50 th percentile. We can also extend the definition, and define the zeroth quartile, Q 0 , as the 0 th percentile (the smallest value), and the forth quartile, Q 4 , as the 100 th percentile (the largest value).
On a timed math test, the first quartile for time it took to finish the exam was 35 minutes. Interpret the first quartile in the context of this situation.
Solution
- Twenty-five percent of students finished the exam in 35 minutes or less.
- Seventy-five percent of students finished the exam in 35 minutes or more.
- A low percentile could be considered good, as finishing more quickly on a timed exam is desirable. (If you take too long, you might not be able to finish.)
For the 100-meter dash, the third quartile for times for finishing the race was 11.5 seconds. Interpret the third quartile in the context of the situation.
- Answer
-
Twenty-five percent of runners finished the race in 11.5 seconds or more. Seventy-five percent of runners finished the race in 11.5 seconds or less. A lower percentile is good because finishing a race more quickly is desirable.
To calculate quartiles, the data must be ordered from smallest to largest. The median divides ordered data into halves that is half the values are the same number or smaller than the median, and half the values are the same number or larger. Quartiles divide ordered data into quarters, that is, Q1 divides the bottom half into two halves and Q3 divides the top half into two halves.
Consider the following data.
1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1
Ordered from smallest to largest:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
Since there are 14 observations, the median is between the seventh value, 6.8, and the eighth value, 7.2. To find the median, add the two values together and divide by two.
The median is seven. Half of the values are smaller than seven and half of the values are larger than seven.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile, Q 1 , is the middle value of the lower half of the data, and the third quartile, Q 3 , is the middle value, or median, of the upper half of the data. To get the idea, consider the same data set:
1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5
The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is two.
1; 1; 2; 2; 4; 6; 6.8
The number two, which is part of the data, is the first quartile . One-fourth of the entire sets of values are the same as or less than two and three-fourths of the values are more than two.
The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is nine.
The third quartile , Q 3, is nine. Three-fourths (75%) of the ordered data set is less than nine. One-fourth (25%) of the ordered data set is greater than nine. The third quartile is part of the data set in this example.
Sharpe Middle School is applying for a grant that will be used to add fitness equipment to the gym. The principal surveyed 15 anonymous students to determine how many minutes a day the students spend exercising. The results from the 15 anonymous students are shown.
0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes; 10 minutes; 45 minutes; 30 minutes; 300 minutes; 90 minutes; 30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes
Determine the five-value summary.
- Answer
-
Min = 0 Q 1 = 20 Med = 40 Q 3 = 60 Max = 300
The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the data. It is the difference between the third quartile ( Q 3 ) and the first quartile ( Q 1 ).
\[IQR = Q_3 – Q_1 \tag{2.4.1}\]
The IQR can help to determine potential outliers . A value is suspected to be an outlier if it is less than 1.5xIQR below the first quartile or more than 1.5xIQR above the third quartile. Potential outliers always require further investigation.
For the following 13 real estate prices, calculate the IQR and determine if any prices are potential outliers. Prices are in dollars.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000
Answer
Order the data from smallest to largest.
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000
\[M = 488,800 \nonumber\]
\[Q_{1} = \dfrac{230,500 + 387,000}{2} = 308,750\nonumber\]
\[Q_{3} = \dfrac{639,000 + 659,000}{2} = 649,000\nonumber\]
\[IQR = 649,000 - 308,750 = 340,250\nonumber\]
\[(1.5)(IQR) = (1.5)(340,250) = 510,375\nonumber\]
\[Q_{1} - (1.5)(IQR) = 308,750 - 510,375 = –201,625\nonumber\]
\[Q_{3} + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375\nonumber\]
No house price is less than –201,625. However, 5,500,000 is more than 1,159,375. Therefore, 5,500,000 is a potential outlier .
For the following 11 salaries, calculate the IQR and determine if any salaries are outliers. The salaries are in dollars.
$33,000; $64,500; $28,000; $54,000; $72,000; $68,500; $69,000; $42,000; $54,000; $120,000; $40,500
Answer
Order the data from smallest to largest.
$28,000; $33,000; $40,500; $42,000; $54,000; $54,000; $64,500; $68,500; $69,000; $72,000; $120,000
Median = $54,000
\[Q_{1} = $40,500\nonumber\]
\[Q_{3} = $69,000\nonumber\]
\[IQR = $69,000 - $40,500 = $28,500\nonumber\]
\[(1.5)(IQR) = (1.5)($28,500) = $42,750\nonumber\]
\[Q_{1} - (1.5)(IQR) = $40,500 - $42,750 = -$2,250\nonumber\]
\[Q_{3} + (1.5)(IQR) = $69,000 + $42,750 = $111,750\nonumber\]
No salary is less than –$2,250. However, $120,000 is more than $11,750, so $120,000 is a potential outlier.
Which one is better? The use of z-scores as a measure of relative standing can be refined and made more precise by applying the Chebyshev’s rule, but percentiles usually give a more accurate and meaningful measure of relative standing than z-scores. However, to compute the percentile one needs to have access to the entire data set while to compute the z-score one needs to know only the mean and standard deviation of the data set. With very little information, z-scores provide a feasible alternative to percentiles for measuring relative standing. For example, for a student to be able to compute the percentile of their exam score the instructor must allow the access to the entire list of scores which is impossible, but to compute the z-score the instructor must share only the mean and the standard deviation which is something instructors are more comfortable doing.