Once data has been collected, what you usually have is simply a long list of the raw data. It is very difficult, if not impossible, to determine any patterns or underlying “themes” from a list of data, especially if the data set has more than 20 or so elements. There are three main methods used to summarize qualitative data: in a table (tabular form), in a bar graph, or with a pie chart.

### Tables

An easy way to initially summarize the data is with a **frequency distribution**, which simply lists each main category in the data set, along with the corresponding number of occurrences within each category.

For example, take a bag of plain M&M candies (the ones in the brown package). If you open the bag and simply dump the M&Ms into a bowl, you see lots of colors, but no underlying patterns. If, however, you divide the M&Ms into a separate category for each color (Brown, Red, Blue, Yellow, Green, and Orange), you can count the number of M&Ms of each color. A frequency distribution for a bag of M&Ms might look like:

COLOR |
FREQUENCY (NUMBER) |
---|---|

Brown | 9 |

Red | 11 |

Blue | 12 |

Yellow | 8 |

Green | 4 |

Orange | 14 |

If you were to open another bag, your frequencies will be somewhat different.

It may still be difficult to identify any underlying patterns if your data set has a large number of categories, or if the frequencies are large numbers. In this case, there is a better way to make a table. If you simply divide the frequency of each category by the total population size, you create a **relative frequency distribution**, which lists the percent (of proportion) of observations within each category relative to the total number of observations. Mathematically, this is easy. We just divide the frequency in each category by the total number to get a percent. For my bag of M&M’s, which had 58 total M&M’s, the relative frequency distribution would be:

COLOR |
RELATIVE FREQUENCY |
---|---|

Brown | 0.155 ( = 9/58 ) |

Red | 0.190 ( = 11/58 ) |

Blue | 0.207 ( = 12/58 ) |

Yellow | 0.138 ( = 8/58 ) |

Green | 0.069 ( = 4/58 ) |

Orange | 0.241 ( = 14/58 ) |

Because every frequency is now related to the total, it is extremely easy to make comparisons among the different categories, since they are now on the same “scale.” One quick check is to add up the percentages. Since we counted every M&M within the bag, and every M&M must belong to one of the six categories, our percentages should add up to 1 ( = 100%). Every once in a while, due to rounding errors, you may not get exactly 1.

### Bar Graphs

Although tables are useful, they still aren’t a nice “picture” of the data. In general, visual methods, such as bar graphs, provide a much better summary of data than just a table alone. This doesn’t mean we wasted our time creating a table, because we’ll need it to draw our bar graph!

A **bar graph** is a graph constructed in the *Cartesian coordinate system* where data categories are listed on the *x*-axis, and a bar (or rectangle) is drawn above each category, where the height of each rectangle corresponds to each category’s frequency or relative frequency. In addition, a horizontal space separates each category (this helps distinguish between data that is distinct from data that has a continuous “flow”). Such graphs are typically easy to create by hand, and even easier in computational tools like Excel. One drawback to creating a bar graph by hand is that you need to be very careful and precise when drawing the heights of the rectangles, so as not to provide an inaccurate picture of the data. Two bar graphs for the above distributions are shown below. The graph on the left is the bar graph of our frequency distribution, and the graph on the right is the bar graph of our relative frequency distribution. Notice that the values on the vertical scale in the left bar graph are the **counts** of M&Ms within each category, and the values on the vertical scale in the right bar graph are the **percentages** of M&Ms within each category. Both graphs tell the same story and allow for an easier category-to-category comparison than the tables.

There are some common good practices in constructing bar graphs that should be followed. With the horizontal scale, categories should be spaced equally apart, and the rectangles should have the same widths. The vertical scale should begin with 0, should be incremented in reasonable steps, and should go somewhat, but not significantly, beyond the largest frequency or relative frequency.

One more simplification with bar graphs that allows for easier comparisons (especially if you have numerous categories) is to arrange the bars in order of height, starting with the highest category on the left, followed by the second highest, etc., until the lowest category, which will be all the way on the right. This special type of bar graph is commonly referred to as a **Pareto Chart**, named after the fellow who thought of this arrangement. A **Pareto chart** for our relative frequency distribution of M&M’s looks like:

### Pie Charts

The final graphical method for categorical data is a pie chart. Although not used for conveying information in scientific fields (you’ll never open up a scientific research paper and see a pie chart), newspapers and other media use these because they are a relatively simple way to get across their point quite quickly. However, pie charts are not very effective if there are too many categories or if some relative frequencies are too small.

To create a pie chart, we need the percent values from our relative frequency distribution. Since a whole pie is 100% of itself, we use a pie piece with an appropriate size to represent the percent of data within each category. Different colors should be used to distinguish each category, and each category should be labeled with the category name and relative frequency. The **pie chart** above represents the M&M data.