R Tutorial : Differential Gene Expression Overview

Показать описание

---

Differential gene expression analysis is a powerful technique to determine whether genes are expressed at significantly different levels between two or more sample groups. We will use the DESeq2 package to model the gene counts and identify differentially expressed genes.

In this image, we see a heat map of genes as rows colored by number of counts. These genes represent the genes with large expression differences or fold changes between sample groups.

To determine which genes are differentially expressed, one might ask 'why not just identify the genes with the largest fold changes in expression between sample groups?'.

To get at the answer, let's observe the plot of normalized counts for gene A. The points represent the gene A expression levels for five biological replicates for 'untreated' and 'treated' conditions.

The mean expression for the 'treated' condition is over twice that of the untreated. However, there appears to be greater variation in the 'treated' condition and the difference in expression may not be significant. We need to account for variation in the data when we determine whether genes are differentially expressed.

Therefore, the goal of differential expression analysis is to determine for each gene whether the differences in expression between groups is significant given the amount of variation within groups, or between the biological replicates.

To explore the workflow, we will be using a publicly available RNA-Seq dataset from Gerarduzzi et al from the journal JCI Insight. In this paper, the goal of the RNA-Seq experiment was to explore why mice over-expressing the Smoc2 gene, or producing more Smoc2 mRNA than normal, are more likely to develop kidney fibrosis.

Smoc2, or Secreted modular calcium-binding protein 2, has been shown to have increased expression in kidney fibrosis, which is characterized by an excess of extracellular matrix in the space between tubules and capillaries within the kidney. However, it is unknown how Smoc2 functions in the induction and progression of fibrosis.

There are four sample groups being tested: normal, control mice, referred to as wild type mice, with and without fibrosis and Smoc2 over-expressing mice with and without fibrosis.

There are three biological replicates for all normal samples and four replicates for all fibrosis samples. Initially, we will explore the effect of fibrosis on gene expression using 'Wild type' samples during lectures and 'Smoc2 over-expression' data during exercises.

To test whether the expression of genes between two or more groups is significantly different, we need an appropriate statistical model.

An appropriate statistical model is determined by the count distribution. When we plot the distribution of counts for a single sample, we can visualize key features of RNA-Seq count data, including a large proportion of genes with low counts and many genes with zero counts. Also note the long right tail, which is due to there being no limit for maximum expression in RNA-Seq data.

If there was no expression variation between biological replicates, a frequently used count distribution known as the Poisson distribution, would be an appropriate model. But, there is always biological variation, and this additional variation present in RNA-Seq data can be modeled well using the negative binomial model, which we will be using as part of DESeq2.

To start the differential expression analysis we require the raw counts of all genes to be a data frame, with gene IDs as row names and sample names as column names. Each cell represents the number of reads that aligned to the corresponding gene for a given sample.

In addition to our raw counts, we require sample metadata. At the very least, we need to know which of our samples correspond to each condition.

To generate our metadata, we create a vector for each column and combine the vectors into a data frame. The sample names are added as the row names.

Let's practice exploring counts and getting our files ready for analysis.

#DataCamp #RTutorial #RNASeqwithBioconductorinR