filmov
tv
R Tutorial - Making Basic Graphics in R
Показать описание
Learn how you can create basic graphics in R.
One of the main reasons that both academic and professional workers turn to R, is because of its very strong graphical capabilities. Walking away from graphical point and click programs such as Excel and Sigmaplot can be quite scary, but you'll soon see that R has much to offer. The big difference is that you create plots with lines of R code. You can thus perfectly replicate and modify plots inside R. This idea nicely fits into the spirit of reproducibility. The graphics package, that is loaded by default in R, already provides quite some functionality to create beautiful, publication-quality plots. Of course, throughout time, also a bunch of R packages have been developed to create visualizations differently or to build visualizations in very specific fields of study. Examples of very popular packages are ggplot2, ggvis and lattice, but I won't talk about those here.
Among many others, the graphics package contains two functions that I'll talk about: `plot()` and `hist()`. I'll start with the `plot()` function first. This function is very generic, meaning that it can plot many different things. Depending on the type of data you give it, `plot()` will generate different plots. Typically, you'll want to plot one or two columns from a data frame, but you can also plot linear models, kernel density estimates and many more.
Suppose we have a data frame `countries`, containing information on all of the countries in the world. There's of course the name of the country, but also the area, the continent it belongs to and other information such as population and religion. If you have a look at the structure of this data frame, you can see that there are numerical variables, characters, but also categorical variables, such as continent and religion. How would `plot()` handle these different data types? Let's find out!
Let's try to plot the continent column of `countries`. We do the selection using the dollar sign:
Cool. It looks like R figured out that continent is a factor, and that you'll probably want a bar chart, containing the number of countries there are in each continent. It also automatically assigned labels to the different bars. Now, what if you decide to plot a continuous variable, such as the population:
This time, the populations of the countries are shown on an indexed plot. The first country corresponds to the index 1, while the population of the fiftieth observation in the data frame corresponds to index 50. Of course it's also possible to plot two continuous variables, such as area versus population:
The result is a scatter plot, showing a dot for each country. There are some huge countries with many people living there, but also many smaller countries with also less people. It makes perfect sense that area and population are somewhat related, right? To make this relationship more clear, we can apply the logarithm function on both of the area and population vectors. You can use the `log()` function twice:
For every continent now, a stacked bar chart of the different religions is depicted. On the right, you conveniently see the axis showing the proportions of each religion in each continent. If you switch the two variables inside the plot function, for every religion, a stacked bar chart of the different continents is depicted. This means that the first element in the `plot()` function is the variable on the horizontal axis, the x axis, while the second element is the element on the vertical axis, the y axis.
All these examples show that the plot function is very capable at visualizing different kinds of information and manages to display the information in an interpretable way.
To better understand your numerical data, it's often a good idea to have a look at its distribution. You can do this with the `hist()` function, which is short for histogram. Basically, a histogram is a visual representation of the distribution of your dataset by placing all the values in bins and plotting how many values there are in each bin. Say we want to get a first impression on the population in all of Africa's countries. With a logical vector, africa_obs, you can perform a selection on the countries data frame, to create sub data frame that contains only african countries.
One of the main reasons that both academic and professional workers turn to R, is because of its very strong graphical capabilities. Walking away from graphical point and click programs such as Excel and Sigmaplot can be quite scary, but you'll soon see that R has much to offer. The big difference is that you create plots with lines of R code. You can thus perfectly replicate and modify plots inside R. This idea nicely fits into the spirit of reproducibility. The graphics package, that is loaded by default in R, already provides quite some functionality to create beautiful, publication-quality plots. Of course, throughout time, also a bunch of R packages have been developed to create visualizations differently or to build visualizations in very specific fields of study. Examples of very popular packages are ggplot2, ggvis and lattice, but I won't talk about those here.
Among many others, the graphics package contains two functions that I'll talk about: `plot()` and `hist()`. I'll start with the `plot()` function first. This function is very generic, meaning that it can plot many different things. Depending on the type of data you give it, `plot()` will generate different plots. Typically, you'll want to plot one or two columns from a data frame, but you can also plot linear models, kernel density estimates and many more.
Suppose we have a data frame `countries`, containing information on all of the countries in the world. There's of course the name of the country, but also the area, the continent it belongs to and other information such as population and religion. If you have a look at the structure of this data frame, you can see that there are numerical variables, characters, but also categorical variables, such as continent and religion. How would `plot()` handle these different data types? Let's find out!
Let's try to plot the continent column of `countries`. We do the selection using the dollar sign:
Cool. It looks like R figured out that continent is a factor, and that you'll probably want a bar chart, containing the number of countries there are in each continent. It also automatically assigned labels to the different bars. Now, what if you decide to plot a continuous variable, such as the population:
This time, the populations of the countries are shown on an indexed plot. The first country corresponds to the index 1, while the population of the fiftieth observation in the data frame corresponds to index 50. Of course it's also possible to plot two continuous variables, such as area versus population:
The result is a scatter plot, showing a dot for each country. There are some huge countries with many people living there, but also many smaller countries with also less people. It makes perfect sense that area and population are somewhat related, right? To make this relationship more clear, we can apply the logarithm function on both of the area and population vectors. You can use the `log()` function twice:
For every continent now, a stacked bar chart of the different religions is depicted. On the right, you conveniently see the axis showing the proportions of each religion in each continent. If you switch the two variables inside the plot function, for every religion, a stacked bar chart of the different continents is depicted. This means that the first element in the `plot()` function is the variable on the horizontal axis, the x axis, while the second element is the element on the vertical axis, the y axis.
All these examples show that the plot function is very capable at visualizing different kinds of information and manages to display the information in an interpretable way.
To better understand your numerical data, it's often a good idea to have a look at its distribution. You can do this with the `hist()` function, which is short for histogram. Basically, a histogram is a visual representation of the distribution of your dataset by placing all the values in bins and plotting how many values there are in each bin. Say we want to get a first impression on the population in all of Africa's countries. With a logical vector, africa_obs, you can perform a selection on the countries data frame, to create sub data frame that contains only african countries.
Комментарии