4.1: Introduction to Statistics and Sampling
In all aspects of our professional and personal lives we make (or try to make) data driven decisions. Statistics is the science of gathering, analyzing, and making predictions from the data we gather. Statistics provides tools that you need in order to react intelligently to information you hear or read. In this sense, statistics is one of the most important things that you can study.
Here are some statistical claims that we have heard many times. (We are not saying that each one of these claims is true!)
- 4 out of 5 dentists recommend Dentyne.
- Almost 85% of lung cancers in men and 45% in women are tobacco-related.
- People tend to be more persuasive when they look others directly in the eye and speak loudly and quickly.
- Women make 75 cents to every dollar a man makes when they work the same job.
- A surprising new study shows that eating egg whites can increase one's life span.
- People predict that it is very unlikely there will ever be another baseball player with a batting average over 400.
- There is an 80% chance that in a room full of 30 people that at least two people will share the same birthday.
- 79.48% of all statistics are made up on the spot.
All of these claims are statistical in character. We suspect that some of them sound familiar. If not, we bet that you have heard other claims like them. Notice how diverse the examples are. They come from psychology, health, law, sports, business, etc. Indeed, data and data-interpretation show up in discourse from virtually every facet of contemporary life.
Just as important as detecting the deceptive use of statistics is the appreciation of the proper use of statistics. You must learn to recognize statistical evidence that supports a stated conclusion. When a research team is testing a new treatment for a disease, statistics allows them to conclude based on a relatively small trial that there is good evidence their drug is effective. Statistics allowed prosecutors in the 1950’s and 60’s to demonstrate racial bias existed in jury panels. Statistics are all around you, sometimes used well, sometimes not. We must learn how to distinguish the two cases.
Populations and Samples
Before we begin gathering and analyzing data we need to characterize the population we are studying. If we want to study the amount of money spent on textbooks by a typical first-year college student, our population might be all first-year students at your college. Or it might be
- All first-year community college students in the state of Maryland.
- All first-year students at public colleges and universities in the state of Maryland.
- All first-year students at all colleges and universities in the state of Maryland.
- All first-year students at all colleges and universities in the entire United States.
- and so on...
The population of a study is the group the collected data is intended to describe.
Why is it important to specify the population? We might get different answers to our question as we vary the population we are studying. First-year students at the University of Maryland might take slightly more diverse courses than those at PGCC, and some of these courses may require less popular textbooks that cost more. Or, on the other hand, the University of Maryland Bookstore might have a larger pool of used textbooks, reducing the cost of these books to the students. Whichever the case, the data we gather from PGCC will probably different in nature than that from the University of Maryland. Particularly when conveying our results to others, we want to be clear about the population we are describing with our data.
A newspaper website contains a poll asking people their opinion on a recent news article. What is the population?
Solution
While the population may have been intended to be all people, the real population of the survey is readers of the website because data is gathered only from the website.
If we were able to gather data from every member of the population under study and find the average amount of money spent on textbooks by first-year students at PGCC during the 2022-2023 academic year, the resulting number would be called a parameter .
A parameter is a value (average, percentage, etc.) calculated using all the data from a population
However, we seldom see parameters. This is because surveying an entire population is usually very time-consuming and expensive unless the population is very small or the data has already been collected for us. In those cases where data is collected from the entire population, the process is called a census .
A survey of an entire population is called a census .
You are probably familiar with two common censuses: 1) the official government Census that attempts to count the population of the U.S. every ten years, and 2) voting, which asks the opinion of all eligible voters in a district. The first of these demonstrates one additional problem with a census: the difficulty in finding and getting participation from everyone in a large population, which can bias, or skew, the results.
There are occasionally times when a census is appropriate, usually when the population is fairly small. For example, if the manager of Starbucks wanted to know the average number of hours her employees worked last week, she should be able to pull up payroll records or ask each employee directly.
Since surveying an entire population is often impractical, we usually select a sample to study.
A sample is a smaller subset of the entire population, ideally one that is fairly representative of the whole population.
We will discuss sampling methods in greater detail momentarily. For now, let us assume that samples are chosen in an appropriate manner. If we survey a sample of the entire group we want to study (say, 100 first-year students PGCC) and find the average amount of money spent by these students on textbooks, the resulting number is called a statistic .
A statistic is a value (average, percentage, etc.) calculated using the data from a sample.
A researcher wants to know how citizens of Tacoma feels about a voter initiative. To study this, she goes to the Tacoma Mall and randomly selects 500 shoppers to ask them their opinion. The researcher finds that 60% indicate they are supportive of the initiative. What is the sample and population? Is the 60% value a parameter or a statistic?
Solution
The sample is the 500 shoppers questioned. The population is less clear. While the intended population of this survey was Tacoma citizens, the population was mall shoppers. There is no reason to assume that the 500 shoppers questioned would be representative of all Tacoma citizens.
The 60% value was based on the sample, so it is a statistic.
To determine the average length of trout in a lake, researchers catch 20 fish and measure them. What is the sample and population in this study?
- Answer
-
The sample is the 20 fish caught. The population is all fish in the lake. The sample may be somewhat unrepresentative of the population since not all fish may be large enough to catch with the bait.
Sampling Methods
Now that you know that you have to take samples in order to gather data, the next question is how to best gather a sample. There are many ways to take samples. Not all of them will result in a representative sample. Also, just because a sample is large does nt mean it is a good sample. As an example, you can take a sample involving one million people to find out if they feel there should be more gun control. But if you only ask members of the National Rifle Association (NRA) you may get biased results. The same is true if you only ask members of the Coalition to Stop Gun Violence. You need to make sure that you ask a cross-section of individuals. Let's look at the types of samples that can be taken. Do realize that no sample is perfect and any sample may not result in a representation of the population.
One way to ensure that the sample has a reasonable chance of mirroring the population is to employ randomness . The most basic random method is simple random samplin g.
A random sample is one in which each member of the population has an equal probability of being chosen. A simple random sample is one in which every member of the population and any group of members has an equal probability of being chosen.
If we could somehow identify all likely voters in the state, put each of their names on a piece of paper, toss the slips into a (very large) hat and draw 1,000 slips out of the hat, we would have a simple random sample.
In practice, computers are better suited for this sort of endeavor than millions of slips of paper and extremely large headgear. The best procedure to select a random sample is to use a random number generator program that you can find in Excel or on a TI graphing calculator.
Another sampling technique that helps to assure that various groups of population has equal provability to be represented in a sample is known as stratified sampling . Stratified sampling is done by breaking the population into groups of individuals sharing some common trait (like gender, ethnicity, political affiliation, grade level, etc .) Then, some individuals are random selected from each of these groups to form the stratified sample.
In stratified sampling , a population is divided into a number of subgroups (called strata). Random samples are then taken from each subgroup with sample sizes proportional to the size of the subgroup in the population.
In a particular state, previous data indicated that the electorate was comprised of 39% Democrats, 37% Republicans and 24% independents. In a sample of 1,000 people, they would then expect to get about 390 Democrats, 370 Republicans and 240 independents.
To accomplish this, they could randomly select 390 people from among those voters known to be Democrats, 370 from those known to be Republicans, and 240 from those with no party affiliation.
Another sampling method is cluster sampling. Again, the population is divided into groups, but this time one or more whole groups are randomly selected to be in the sample.
In cluster sampling , the population is divided into subgroups (called clusters), and a set of subgroups are selected to be in the sample.
Cluster sampling is often done by selecting a neighborhood, block, or street at random from within a town or city. It is also used at large public gatherings or rallies. This is how the National Park Police determines the number of people who show up for events on the National Mall. They may take a picture of a small, representative area of the crowd, count the individuals in just that area, and then use that count to estimate the total crowd in attendance.
Cluster sampling is very useful in geographic studies such as about the opinions of people in a state or measuring the diameter at breast height of trees in a national forest. In both situations, a cluster sample reduces the traveling distances that occur in a simple random sample. For example, suppose that the Gallup Poll needs to perform a public opinion poll of all registered voters in Colorado. In order to select a good sample using simple random sampling, the pollsters would have to have the names of all the registered voters in Colorado and then randomly select a subset of these names. This may be very difficult to do. So, they will use a cluster sample instead. Start by dividing the state of Colorado up into categories or groups geographically. Randomly select some of these groups. Now ask all registered voters in each of the chosen groups. This makes the job of the pollsters much easier because they will not have to travel over every inch of the state to get their sample. But, is still a random sample.
If the college wanted to survey students, since students are already divided into classes, they could randomly select 10 classes and give the survey to all the students in those classes. This would be cluster sampling.
Another sampling method is systematic sampling . This method is often used when there is organizational order in the population. After choosing a starting individual at random, individuals for the sample are selected using a pattern. If you have ever chosen teams or groups by counting off by threes or fours, you were engaged in systematic sampling.
In systematic sampling , every \(n^{th}\) member of the population is selected to be in the sample.
To select a sample using systematic sampling, a pollster calls every \(100^{th}\) name in the phone book.
Systematic sampling is not as random as a simple random sample. In the previous example, if your name is Albert Aardvark and your sister Alexis Aardvark is right after you in the phone book, there is no way you could both end up in the sample. But, systematic sampling can yield acceptable samples.
Perhaps the worst types of sampling methods are convenience samples and voluntary response samples.
Convenience sampling is done by collecting data from selecting whoever is convenient to reach.
Voluntary response sampling allows individuals to choose themselves for the sample.
A pollster stands on a street corner and interviews the first 100 people who agree to speak to him. This is a convenience sample because the pollster gathers data from those who are easy to each.
A website has a survey asking readers to give their opinion on a tax proposal. This is a self-selected sample, or voluntary response sample, in which respondents volunteer to participate.
Convenience samples should be avoided because they do not use randomization. Individuals that are convenient to reach generally are alike in some way and may not be representative of the entire population. Likewise, voluntary response samples do not use randomization either. Usually voluntary response samples are skewed towards people who have a particularly strong opinion about the subject of the survey or who just have way too much time on their hands and enjoy taking surveys.
Determine if the sample type is simple random sample, stratified sample, systematic sample, cluster sample, or convenience sample.
- A researcher wants to determine the different species of trees that are in the Coconino National Forest. She divides the forest using a grid system. She then randomly picks 20 different sections and records the species of every tree in each of the chosen sections.
Solution: This is a cluster sample since she randomly selected some of the groups and all individuals in the chosen groups were surveyed.
- A pollster stands in front of an organic foods grocery store and asks people leaving the store how concerned they are about pesticides in their food.
Solution: This is a convenience sample since the person is just standing out in front of one store. Most likely the people leaving an organic food grocery store are concerned about pesticides in their food, so the sample would be biased.
- The Pew Research Center wants to determine the education level of mothers. They randomly ask mothers to say if they had some high school, graduated high school, some college, graduated from college, or have an advanced degree.
Solution: This is a simple random sample , since the individuals were picked randomly.
- Penn State wants to determine the salaries of their graduates in the majors of agricultural sciences, business, engineering, and education. They randomly ask 50 graduates of agricultural sciences, 100 graduates of business, 200 graduates of engineering, and 75 graduates of education what their salaries are.
Solution: This is a stratified sample since all groups were used and then random samples were taken inside each group.
- In order for the Ford Motor Company to ensure quality of their cars, they test every 130 th car coming off the assembly line of their Ohio Assembly Plant in Avon Lake, OH.
Solution: This is a systematic sample since the sample was determined using a numerical pattern: every 130 th car was chosen.
In each case, indicate which sampling method was used.
- Homework was collected from every 4 th person who entered the classroom.
- A sample was selected to contain 25 men and 35 women.
- Viewers of a new show are invited to vote for their favorite character on the show’s website.
- A website randomly selects 50 of their customers for a satisfaction survey.
- To survey voters in a town, a polling company randomly selects 10 city blocks, and interviews everyone who lives on those blocks.
- Answer
-
- Systematic b. Stratified c. Voluntary response d. Simple random e. Cluster