R Tutorial: Data Visualization in R (part 2)

Показать описание

Base R graphics supports many different plot types and this chapter introduces several of them that are particularly useful in seeing important features in a dataset and in explaining those features to others. We start with simple tools like histograms and density plots for characterizing one variable at a time, move on to scatter plots and other useful tools for showing how two variables relate, and finally introduce some tools for visualizing more complex relationships in our dataset.

In looking at a new dataset, it is useful to start by looking at one variable at a time, to see whether it conforms to our expectations.

For numerical data, a common expectation is that the values should conform approximately to the Gaussian or normal distribution, described by the "bell curve" shown here. Under this assumption, the observed data values are clustered around the mean value - here, the value 0 indicated with the solid red line - becoming less likely as we move further away from this central value.

The standard measure of "scatter" or "spread" in Gaussian data is the standard deviation, and this plot shows how the likelihood of observing data values decreases as we move out from the mean, first one standard deviation away (the green dashed lines), then two (the yellow lines), and finally three (the blue lines).

These expectations are not always met, but they have led to the development of several useful graphical tools for characterizing numerical data. Here, the upper left plot shows 200 simulated Gaussian data values, and the other three plots show their characterizations using three of these tools.

The upper right plot shows the histogram, probably the best-known of these tools. It is constructed by, first, dividing the range of the data values into subsets or "bins," and then counting the number of values that appear in each bin. The basic histogram is a barplot where each bar represents a bin and its height represents the number of observations in that bin.

The lower left shows a density plot, which we can think of as a normalized and smoothed histogram. If our data values approximately satisfied the Gaussian data assumption, this plot should look like the "bell curve" shown earlier. Because this example is based on simulated Gaussian data, this density plot should have this shape, and it does, at least roughly. The fact that it is only "approximately bell-shaped" is a consequence of the finite size of our data sample and it emphasizes that while density plots are good enough to be very useful, they are not perfect.

The lower right shows a normal QQ-plot, not as well known as these other tools, but extremely useful. It is constructed by first sorting the data values from smallest to largest, and then plotting them against an x-axis derived from the Gaussian distribution. For Gaussian data, these sorted points should lie very close to the solid red reference line shown in this plot.

This plot was generated with the qqPlot() function from the car package, which provides 95% confidence intervals - the dashed red lines around the reference line - to help us see how well our data conforms to the Gaussian assumption. Since all of these data points fall within these limits, the Gaussian data assumption appears excellent here, as it should.

Here are the results when we apply these tools to a real dataset. The upper left plot shows the duration values from the geyser data frame in the MASS package, describing the durations of 299 eruptions of the Old Faithful geyser in Yellowstone National Park. As before, the other three plots show the histogram (upper right), the density plot (lower left), and the normal QQ-plot (lower right). Here, the density plot is probably the most useful, highlighting the fact that these data values exhibit a bimodal distribution, with not one but two "most likely values."

Now, it's your turn. The following exercises give you a chance to try these tools for yourself.