Skip to main content
Mathematics LibreTexts

1.1: Introduction to Statistics

  • Page ID
    105808
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Section 1: The Nature of Statistics

    Next, we will discuss the nature of statistics and what one should expect to learn in the introductory statistics course.

    Query \(\PageIndex{1}\)

    There are many quotes in literature that describe statistics, some are accurate, and some are entertaining! Here are some of my favorites:

    • "There are three kinds of lies: lies, damned lies, and statistics."  - Benjamin Disraeli
    • "There are two kinds of statistics, the kind you look up, and the kind you make up. " - Rex Stout
    • "Statistics can be used to support anything - especially statisticians." - Franklin P. Jones
    • "Some use statistics as a drunken man uses lamppost - for support rather than illumination." - Andrew Lang
    • "Figures don't lie, liars figured." - Mark Twain

    What is the subject of interest in Statistics and what should you expect to learn in an Introductory Statistics course? The science of statistics deals with any aspect of the collection, analysis, interpretation, and presentation of data. What is data exactly? Data (from Latin, plural of datum) are collections of observations. Why doesn’t anyone know nor use the word “datum”? Because anywhere you look you are bombarded with data! In the world we live in now, there is a great need for being able to properly handle data! Basically, anything that has to do with data is the subject of interest in a statistics course. In introductory statistics we will learn the aspects of all stages of handling data. Most importantly we will develop the critical thinking skills that will help you to process the data that you are surrounded with. But first we introduce a few fundamental terms.

    The population is the collection of all individuals or items under consideration in a statistical study. A study that involves the entire population is called a census. Censuses were taken as long ago as Roman times, however, conducting a census may be time-consuming, costly, impractical, or even impossible. Two methods other than a census for obtaining information are sampling and experimentation. A sample is a part of the population from which information is obtained. Since the sample will be used to draw conclusions about the entire population, it should be a representative sample – a sample that reflects as closely as possible the relevant characteristics of the population under consideration.

    Did you Know? To see what can happen when a sample is not representative, consider the presidential election of 1936. Before the election, the Literary Digest magazine conducted an opinion poll of the voting population. Its survey team asked a sample of the voting population whether they would vote for Franklin D. Roosevelt, the Democratic candidate, or for Alfred Landon, the Republican candidate. Based on the results of the survey, the magazine predicted an easy win for Landon. But when the actual election results were in, Roosevelt won by the greatest landslide in the history of presidential elections! What happened? The sample was obtained from among people who owned a car or had a telephone. In 1936, that group included only the more well-to-do people, and historically such people tend to vote Republican.

    Over the centuries, records of such things as births, deaths, marriages, and taxes led naturally to the development of methods for organizing and summarizing information called descriptive statistics. Major developments began to occur with the research of Karl Pearson (1857–1936) and Ronald Fisher (1890–1962), who published their findings in the early years of the twentieth century. The methods that were established for drawing and measuring the reliability of conclusions about a population based on information obtained from a sample of the population are now called inferential statistics. Historically, descriptive statistics appeared before inferential statistics. Although inferential statistics is a newer arrival, since the work of Pearson and Fisher, it has evolved rapidly and is now applied in a myriad of fields. Familiarity with statistics will help you make sense of many things you read in newspapers, magazines, and on the Internet or watch in the news.

    Example: In 1948, the Washington Senators played 153 games, winning 56 and losing 97. They finished seventh in the American League and were led in hitting by Bud Stewart, whose batting average was 0.279. Baseball statisticians compiled these and many other statistics by organizing the complete records for each game of the season. Do you think this example demonstrates the methods of descriptive or inferential statistics? The correct answer is descriptive since we have simply presented the record of Bud Stewart and his team performance without any attempt to make any predictions whatsoever.

    Example: In the fall of 1948, the Gallup Poll taken just prior to the election predicted that President Truman would win only 44.5% of the vote and be defeated by the Republican nominee, Thomas E. Dewey. Do you think this example demonstrates the methods of descriptive or inferential statistics? The correct answer is inferential since we are trying to estimate the opinion about the entire population based on a sample. In reality, the statisticians had predicted it incorrectly. Truman won more than 49% of the vote and, with it, the presidency. The Gallup Organization modified some of its procedures and has correctly predicted the winner many times since then.

    The process of collecting, analyzing, interpreting, and presenting data can be referred to as a statistical study. We can distinguish several stages in a statistical study:

    1. Prepare – at this stage, a researcher will decide the purpose and the methods of the statistical study.
    2. Collect the data – at this stage, a researcher will collect the data in the way decided in the previous stage.
    3. Process the data – at this stage, a researcher will summarize the data or will perform a certain statistical procedure with it.
    4. Conclude – at this stage, a researcher will interpret the results of the study based on the purpose and the set goal.

    We just discussed the nature of statistics and what one should expect to learn in an introductory statistics course. Next, we will discuss one-by-one all the stages of a statistical study.

    Section 2: Preparing for a Study

    When preparing for a statistical study, it is important to consider the context, the purpose, and the methods. Based on the methods we can classify a statistical study in the following way. In an observational study, researchers simply observe characteristics and take measurements, as in a sample survey. In a designed experiment, researchers impose treatments and controls first and then observe characteristics and take measurements. Note that, in an observational study, someone is observing data that already exist (i.e., the data were there and would be there regardless of whether someone was observing it or not). In a designed experiment, however, the data do not exist until someone does "the experiment" that produces the data.

    Example: Several studies have been conducted to analyze the relationship between vasectomies and prostate cancer. One such study found 113 cases of prostate cancer among 22,000 men who had a vasectomy. This compares to a rate of 70 cases per 22,000 among men who didn’t have a vasectomy. The study shows about a 60% elevated risk of prostate cancer for men who have had a vasectomy. Do you think this is an observational study or a designed experiment? The correct answer is an observational study because the subjects were simply observed, and the data was recorded.

    Example: For several years, evidence had been mounting that folic acid reduces major birth defects. In one such study, the doctors enrolled 4753 women prior to conception and divided them randomly into two groups. One group took daily multivitamins containing 0.8 mg of folic acid, whereas the other group received only trace elements. A drastic reduction in the rate of major birth defects occurred among the women who took folic acid: 13 per 1000, as compared to 23 per 1000 for those women who did not take folic acid. Do you think this is an observational study or a designed experiment? The correct answer is designed experiment because the subjects were not just simply observed but were randomly assigned a treatment to control the experiment.

    Through an observational study one may observe an association (positive or negative) between the two variables, that is the growth of one variable can be linked to the growth or decline of the other variable. Do not confuse an association with a causation when the change in one variable directly triggers the change in the other variable. Remember that association does not imply causation!!! It is important to understand that observational studies can only reveal an association, and to establish the causation between the two variables one has to design an experiment.

    Two variables may be strongly associated because they are both related to another variable, called a lurking variable, which is a variable previously not considered that causes changes in the two variables under consideration. In the following examples, what do you think a lurking variable is assuming that a strong association between two variables was observed?

    • A strong association can be observed between the average number of computers per person in a country and that country’s average life expectancy, but it doesn’t mean that the number of computers may increase the life expectancy nor vice versa. Both variables can be related to the country’s GDP.
    • A strong association can be observed between the number of firefighters at a fire and the damage caused by the fire, but it doesn’t mean that more firefighters cause more damage to the property. Both variables can be related to the intensity of the fire.

    If you want to learn more about the differences between the association and causation, I invite you to read the history of the US stop-smoking movement. The tobacco companies fought for a long time against the idea of causation as it was hard to argue the strong association between the tobacco consumption and the risk of lung cancer.

    Section 3: Collecting Data

    When collecting data, it is important to be aware of sampling errors and non-sampling errors. Since no two samples are alike, the actual process of sampling causes a natural variation which is what we call the sampling error. The only way to avoid the sampling error is to perform the census. A non-sampling error is an issue that affects the reliability of sampling data other than a natural variation; it includes a variety of human errors including poor study design, biased sampling methods, inaccurate information provided by study participants, data entry errors, and poor analysis. In reality, a sample will never be as good as the entire population and since the sampling error is unavoidable, we always try to minimize or eliminate the non-sampling errors.

    Example: Prior to the 2016 presidential elections, Bobert surveyed ten random people of which 4 said they would vote for Trump and 6 said they will vote for Clinton. The actual popular vote was 46.1% for Trump and 48.2% for Clinton. Do you think this was a sampling error or non-sampling error? This is an example of sampling error. Bobert did everything right, but one should not be expecting the sample to produce the same exact results as the population!

    Example: Prior to the 2016 presidential elections, Bobert surveyed all his Facebook friends of which 7 (14%) said they would vote for Trump and 43 (86%) said they would vote for Clinton and 37 did not respond. The actual popular vote was 46.1% for Trump and 48.2% for Clinton. Do you think this was a sampling error or non-sampling error? This is an example of a non-sampling error. Despite the large sample size there is no way this sample would have produced the accurate results due to coverage bias and many other implied biases due to the poor sampling method.

    An alternative way to produce data is to design and perform an experiment. While studying various experimental designs is not the purpose of this course one should know the "gold standard" of experimental designs. Here is the diagram showing the sequential flow of a randomized controlled experiment called the "gold standard" because it is so effective:

    A diagram showing the sequential flow of a randomized controlled experiment called the "gold standard".
    Figure \(\PageIndex{3.1}\): A diagram showing the sequential flow of a randomized controlled experiment called the "gold standard".

    A placebo such as a sugar pill that has no medicinal effect is sometimes used as one of the treatments. The following example describes how the gold standard was used in the largest health experiment ever conducted.

    In 1954, an experiment was designed to test the effectiveness of the Salk vaccine in preventing polio, which had killed or paralyzed thousands of children.

    Jonas Salk on the cover of TIME magazine, March 29, 1954
    Figure \(\PageIndex{3.2}\): Jonas Salk on the cover of TIME magazine, March 29, 1954.

    By random selection, 401,974 children were randomly assigned to two groups:

    • 200,745 children were given a treatment consisting of Salk vaccine injections;
    • 201,229 children were injected with a placebo that contained no drug.

    Children were assigned to the treatment or placebo group through a process of random selection, equivalent to flipping a coin. Among the children given the Salk vaccine, 33 later developed paralytic polio, and among the children given a placebo, 115 later developed paralytic polio.

    Section 4: Organizing Data

    After the data have been collected either from sampling or from a designed experiment, comes the next stage of organizing the data. Organizing data can be a very difficult task that may involve many steps such as checking the data quality, data cleaning and many other steps. Also, when organizing the data, the storage medium must be considered and chosen. Working with big data may create its own challenges that are not in the scope of the introductory statistics course. In this course we will focus on different ways to organize the collected data visually and numerically. Now, what will be able to do with the data will depend on what type of data we are working with. Soon we will discuss different types of data and what is there to do with it. The goal of this step of the process is to organize the data so that it can be either analyzed further or explored for patterns and trends! This was a brief introduction to organizing data. We will dedicate more time to this topic later.

    Section 5: Drawing Conclusions

    After the data is processed and organized, comes the next stage of drawing the conclusions. Later, we will learn a variety of procedures each of which will come with a specific way of drawing the conclusions. In general, this is the stage where we reflect on what was done in a study and discuss the significance of the results. A statistical significance is achieved in a study when we get a result that is very unlikely to occur by chance. A common criterion is that we have statistical significance if the likelihood of an event occurring by chance is 5% or less. Results are practically significant when the difference is large enough to be meaningful in real life. What is meaningful may be subjective and may depend on the context. As the next example will show, it is possible that some treatment or finding is effective, but common sense might suggest that the treatment does not make enough of a difference to justify its use or to be practically significant.

    Example: In a study of a weight loss program, 31 subjects lost an average of 2 lbs. after 9 months. Methods of statistics can be used to show that it is unlikely to get these results by chance if the diet had no​ effect. Does the weight loss program have statistical​ significance? According to the last sentence of the report the results are statistically significant! Does the weight loss program have practical​ significance? Would you spend any money on a program that promises that you will lose 2 oz of fat in the next 9 months? Me neither, so the results are practically insignificant!

    Did you know? ProCare Industries once supplied a product named Gender Choice that supposedly increased the chance of a couple having a baby with the gender that they desired. In the absence of any evidence of its effectiveness, the product was banned by the Food and Drug Administration (FDA) as a "gross deception of the consumer." But suppose that the product was tested with 10,000 couples who wanted to have baby girls, and the results consist of 5200 baby girls born in 10,000 births. This result is statistically significant because the likelihood of it happening due to chance is only 0.003%. A chance does not seem like a feasible explanation. That 52% rate of girls is statistically significant, but it lacks practical significance because 52% is only slightly above 50%. Couples would not want to spend the time and money to increase the likelihood of a girl from 50% to 52%.

    While ethics is not the main point of the course, there are typically two ways in which statistics can be used for deception:

    • evil intent on the part of dishonest persons.
    • unintentional errors on the part of people who don't know any better.

    As responsible citizens, we should learn to distinguish between statistical conclusions that are likely to be valid and those that are seriously flawed, regardless of the source. We will look at some examples of deception after we learn how to organize and visualize data.


    1.1: Introduction to Statistics is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?