4.3: Regression
- Last updated
- Sep 10, 2021
- Save as PDF
- Page ID
- 83705
( \newcommand{\kernel}{\mathrm{null}\,}\)
We have seen examples already in the text where linear and quadratic functions are used to model a wide variety of real world phenomena ranging from production costs to the height of a projectile above the ground. In this section, we use some basic tools from statistical analysis to quantify linear and quadratic trends that we may see in real world data in order to generate linear and quadratic models. Our goal is to give the reader an understanding of the basic processes involved, but we are quick to refer the reader to a more advanced course for a complete exposition of this material. Suppose we collected three data points: {(1,2),(3,1),(4,3)}. By plotting these points, we can clearly see that they do not lie along the same line. If we pick any two of the points, we can find a line containing both which completely misses the third, but our aim is to find a line which is in some sense 'close' to all the points, even though it may go through none of them. The way we measure 'closeness' in this case is to find the total squared error between the data points and the line. Consider our three data points and the line y=12x+12. For each of our data points, we find the vertical distance between the point and the line. To accomplish this, we need to find a point on the line directly above or below each data point - in other words, a point on the line with the same x-coordinate as our data point. For example, to find the point on the line directly below (1,2), we plug x=1 into y=12x+12 and we get the point (1,1). Similarly, we get (3,1) to correspond to (3,2) and (4,52) for (4,3).
We find the total squared error E by taking the sum of the squares of the differences of the y-coordinates of each data point and its corresponding point on the line. For the data and line above E=(2−1)2+(1−2)2+(3−52)2=94. Using advanced mathematical machinery, (specifically Calculus and Linear Algebra) it is possible to find the line which results in the lowest value of E. This line is called the least squares regression line, or sometimes the 'line of best fit'. The formula for the line of best fit requires notation we won't present until Chapter 9, so we will revisit it then. The graphing calculator can come to our assistance here, since it has a built-in feature to compute the regression line. We enter the data and perform the Linear Regression feature and we get
The calculator tells us that the line of best fit is y=ax+b where the slope is a≈0.214 and the y-coordinate of the y-intercept is b≈1.428. (We will stick to using three decimal places for our approximations.) Using this line, we compute the total squared error for our data to be E≈1.786. The value r is the correlation coefficient and is a measure of how close the data is to being on the same line. The closer |r| is to 1, the better the linear fit. Since r≈0.327, this tells us that the line of best fit doesn't fit all that well - in other words, our data points aren't close to being linear. The value r2 is called the coefficient of determination and is also a measure of the goodness of fit.\footnote{We refer the interested reader to a course in Statistics to explore the significance of r and r2.} Plotting the data with its regression line results in the picture below.
Our first example looks at energy consumption in the US over the past 50 years.
YearEnergy Usage, in Quads195034.6196045.1197067.8198078.3199084.6200098.9
The unit 1 Quad is 1 Quadrillion = 1015 BTUs, which is enough heat to raise Lake Erie roughly 1∘F}
Example 4.3.1: Energy Consumption
Using the energy consumption data given above,
- Plot the data using a graphing calculator.
- Find the least squares regression line and comment on the goodness of fit.
- Interpret the slope of the line of best fit.
- Use the regression line to predict the annual US energy consumption in the year 2013.
- Use the regression line to predict when the annual consumption will reach 120 Quads.
Solution
- Entering the data into the calculator gives
- Performing a linear regression produces
We can tell both from the correlation coefficient as well as the graph that the regression line is a good fit to the data.
- The slope of the regression line is a≈1.287. To interpret this, recall that the slope is the rate of change of the y-coordinates with respect to the x-coordinates. Since the y-coordinates represent the energy usage in Quads, and the x-coordinates represent years, a slope of positive 1.287 indicates an increase in annual energy usage at the rate of 1.287 Quads per year.
- To predict the energy needs in 2013, we substitute x=2013 into the equation of the line of best fit to get y=1.287(2013)−2473.890≈116.841. The predicted annual energy usage of the US in 2013 is approximately 116.841 Quads.
- To predict when the annual US energy usage will reach 120 Quads, we substitute y=120 into the equation of the line of best fit to get 120=1.287x−2473.908. Solving for x yields x≈2015.454. Since the regression line is increasing, we interpret this result as saying the annual usage in 2015 won't yet be 120 Quads, but that in 2016, the demand will be more than 120 Quads.
Our next example gives us an opportunity to find a nonlinear model to fit the data. According to the National Weather Service, the predicted hourly temperatures for Painesville on March 3, 2009 were given as summarized below.
TimeTemperature, ∘F10AM1711AM1912PM211PM232PM243PM244PM23
To enter this data into the calculator, we need to adjust the x values, since just entering the numbers could cause confusion. (Do you see why?) We have a few options available to us. Perhaps the easiest is to convert the times into the 24 hour clock time so that 1 PM is 13, 2 PM is 14, etc.. If we enter these data into the graphing calculator and plot the points we get
While the beginning of the data looks linear, the temperature begins to fall in the afternoon hours. This sort of behavior reminds us of parabolas, and, sure enough, it is possible to find a parabola of best fit in the same way we found a line of best fit. The process is called quadratic regression and its goal is to minimize the least square error of the data with their corresponding points on the parabola. The calculator has a built in feature for this as well which yields
The coefficient of determination R2 seems reasonably close to 1, and the graph visually seems to be a decent fit. We use this model in our next example.
Example 4.3.2: Quadratic Regression
Using the quadratic model for the temperature data above, predict the warmest temperature of the day. When will this occur?
Solution
The maximum temperature will occur at the vertex of the parabola. Recalling the Vertex Formula, Equation 2.4, x=−b2a≈−9.4642(−0.321)≈14.741. This corresponds to roughly 2:45 PM. To find the temperature, we substitute x=14.741 into y=−0.321x2+9.464x−45.857 to get y≈23.899, or 23.899∘F.
The results of the last example should remind you that regression models are just that, models. Our predicted warmest temperature was found to be 23.899∘F, but our data says it will warm to 24∘F. It's all well and good to observe trends and guess at a model, but a more thorough investigation into why certain data should be linear or quadratic in nature is usually in order - and that, most often, is the business of scientists.