R Tutorial: Differential expression data

Показать описание

---
Next I'll describe the data that you will explore.

For the videos in this first chapter, I will be using data from a study of 344 patients with breast cancer. Specifically I'll analyze the differences between patients that are positive or negative for expression of the estrogen receptor, an important clinical indicator.

In the exercises, you will analyze data from a study of chronic lymphocytic leukemia, or CLL for short. The study measured gene expression in 22 patients with CLL, 8 who were stable and 14 whose disease was progressing.

It's OK if you are unfamiliar with cancer biology. The most important thing to note is that in each of these experiments you are testing for differences between two groups of samples.

For every experiment you analyze in this course, you will focus on 3 main data sets. The first is the expression matrix, which contains the expression measurements. The second is the feature data, which describes each of the measured features, usually genes or proteins. The third is the phenotype data, which describes each of the samples in the study. You will refer to these as x, f, and p to facilitate potentially small screens, but I of course would recommend using more informative variable names in your real analysis code.

For the expression matrix, each row is a feature that was measured, and each column is one of the samples. In the breast cancer experiment, 22,283 genes were measured for 344 samples.

The feature data is a data frame with one row per feature, thus its number of rows is equal to the number of rows of the expression matrix. The columns describe the features, in this case genes. Here the columns correspond to the gene symbol, the database identifier, and the chromosomal location in the genome.

The phenotype data is a data frame with one row per sample, thus its number of rows is equal to the number of columns of the expression matrix. The columns describe the samples, in this case the sample identifier, the age of the subject, and whether or not the tumor sample was positive or negative for the estrogen receptor.

To both practice interacting with these three data sets and to become more familiar with the data, you will create a boxplot for a single gene. The function `boxplot` accepts a formula as its first argument. You list the variable to be plotted on the y-axis to the left of the tilde, and on the right the variable for the x-axis, and R will create one boxplot for each value of the x-axis variable. You can add a title with the argument `main`. To create boxplots of a single gene, you insert the gene expression to the left of the tilde, the phenotype variable to the right, and use the feature data to label the title of the plot. Specifically, to plot the first gene of the breast cancer data, I subset the first row of the expression matrix for the y-axis, and I select the column "er" from the phenotype data frame for the x-axis. For the title, I use the feature data column "symbol", remembering to subset to only include the first row, corresponding to the first gene.

This gene appears to be similar in both groups.

Now it's your turn to create a boxplot for a gene in the leukemia study.

#R #RTutorial #DataCamp #Analysis #limma #Differential #expression #data