RNA Sequencing - Building a FASTQ to BAM pipeline

Показать описание

Learn how to create a computational RNA sequencing pipeline using free and open source bioinformatics software. We will use the R to download the _Saccharomyces cerevisiae_ genome, transcriptome, and known SNPs from Ensembl and use a publicly available paired-end RNA-Seq data set from the Sequence Read Archive (SRA) to develop and test our pipeline in R. Learn how to trim reads, align them, as well as further steps like duplicate removal and howto use the GATK for base re-calibration.

Next time, we'll make the pipeline more generic, and use it to run several data sets and extract RPKM values from our aligned data for differential expression analysis.

Chapters:
00:00:00 - Sound check and introduction
00:01:07 - Overview for today
00:02:53 - Install software I forgot
00:07:43 - Building a primary_assembly reference genome
00:22:25 - Download the transcriptome and known SNPs
00:27:42 - Creating the required genomic index files
00:34:38 - Building the RNA sequencing pipeline
00:37:00 - Execute external commands using R
00:42:42 - Static variables and the folder structure
00:47:00 - Automate downloading reads from SRA
00:51:27 - Trimming reads using Trimmomatic
01:05:43 - RNA paired-end alignment using STAR
01:12:00 - Samtools: BAM index and alignment statistics
01:28:12 - Picard tools: Duplicate removal and readgroup information
01:38:23 - GATK: Base re-calibration using known SNPs
01:45:24 - IGV: Visualize genome, transcriptome, and aligned reads
01:53:26 - What we'll do next time and Outro

#rnaseq #howto #bioinformatics #computationalbiology #academicyoutube #sequencing #ngs #nextgenerationsequencing #educationalvideos #biostreaming #academia #software #biologylecture #virtualmachine #debian #virtualbox #sratoolkit

Рекомендации по теме

Комментарии

UPDATE APRIL 2024

Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.

DannyArends

Dear Danny! I am inspired by your teaching style. I loved the series and found it very very useful and easy to follow. A lot of times coding abilities come with an attitude which gets in the way when people teach. You are the best! Thank you for this series and waiting to learn more!

jaypatankar

Thank you so much for the batch 2 of the tutorial - it's super helpful

augustinechukwunta

Thank you very much for these lectures/tutorials. They are precious to me. I am biologist who struggle to comprehend a lot from was presented here, especially the R codes. I replicate your steps on WSL and I get stuck at IGV, an error came with "no X11 etc.". I spend ages and installed putty and tried to forward and at the end I gave up and will try dual booting or virtual Linux like you. (Also, another error came when using bgzip that it is not known although i installed it and tried t on another ubuntu and it works, very weird. (By the way, sratoolkit on ubuntu, like what you told us to download has different path: it does not have etc or usr, just directly to the folder.)) Oh, the people who are trying to imitate bioinformatician need to know enormous amount of information (plus and pardon me in this, the true bioinformatician themselves seem complicate the matter and making so different versions and steps). I will try to set up account on twitch to follow you also there. Looking forward for the next lecture, IN SHAA ALLAH. Thanks a lot and best wishes. Mohamed.

testforall

Thank you very much .. These videos are very helpful.
btw which video games r u playing? ;)

histephenson

Good day, thank you for the video, very informative. I just want to know if i can use the vitual machine for human genome? I will be checking for gene expression levels between two groups and i figured that is a large data set. I will appreciate your guidance in this. thanks

nomawethumasina

I recently saw that you would be making an explanatory video to address some issues that have arisen with PICARD. Some time ago, I installed all the software and everything worked great with the analysis. Although I haven't used them again... I have a question: Will all the software I have installed still work, or might it encounter errors due to updates? I would like to know if you will be publishing the video from April 27 on any platform. Thank you.

CHD.

I am reading a document called "RNA-Seq workflow: gene-level exploratory analysis and differential expression" and i am very confused about the workflow. They say that transcript abundance quantification methods like Kallisto, Salmon to estimate abundances without aligning reads, followed by tximport package are better because skip the generation of large files...in this video in what moment and with what tool do we do this? i guess i still do not understand all the workflow and all the tools that can be used. Thanks a lot

Iceletters

Thank you very much for these lectures/tutorials. Upon executing bgzip command the following error msg was shown
> cmd <- paste0("bgzip -k
#cat(cmd, "\n")
system(cmd)
bgzip: invalid option -- 'k'
Version: 1.13+ds
Usage: bgzip [OPTIONS] [FILE] ...

vinayaraj

Minute 29:40, the file that tabix read is still compressed?

juliangrandvallet

I am using a UTM virtual Ubuntu on my M2 MacBook Air, can I use the softwares directly on MacBook terminal? Because the Ubuntu is taking a lot of space on my system 😢

JaskaranSingh-ommv

Thank you so much for this informative series! I have a question: What if we cannot find the VCF file on ensemble? Do we skip those parts? Is there somewhere else to look? My organism is Papio anubis. Thank you!

laurynwinter

Hi! Thanks for the video, very instructive!! I have a quick question: Does Picard automatically detect and delete unwanted duplicates from the bam file? or is there a risk to loose biological data information when performing this step?

sMr_Borgov

HI Danny. help me
"I can't find the index file gtf.gz in the Ensembl FTP files. Where can I download it from? My organism is Capsicum annuum."

conlosjuguetesdemihijo

Thanks for your video.. just quick question... among trimming methods, there is also called ngsshort package except for trimmomatic... do you think I can also use ngsshort rather than trimmomatics for rna-seq data?

freezingtolerance

Thank you for your tutorial, There is any way to make the Index genome in star with less than 8Gb RAM (All i assigned to virtual box)?

danieljairenriquezvera

I want to do RNAseq data analysis using the human genome. When reading the following lines in, I get a failure error:

# Generate genome/transcriptome index using STAR
STAR --runThreadN 2 --runMode genomeGenerate \
--genomeDir ~/genome/STAR \
--genomeSAindexNbases 10 \
--sjdbGTFfile \
--genomeFastaFiles

STAR version: 2.7.11b compiled: 2024-01-25T16:12:02-05:00
Apr 06 00:23:00 started STAR run
Apr 06 00:23:00 ... starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Afgebroken

What is going wrong?

dariocosemans

Thanks for the great video. When I try to run the script as one it stops after Step 4.1 and does not start Step 5. Just stays with a plus sign like if something is missing from the code.

solomonantonio

Hi Danny, thank you again for the second series of the tutorial!

I followed your tutorial step-by-step but I encountered an error in generating the index using STAR:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted

My input was similar to yours except that I changed the --genomeSAindexNbases to 14 as I'm working with the human sample:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ~/genome/STAR --genomeSAindexNbases 14 --genomeFastaFiles --sjdbGTFfile

Correct me if I'm wrong but based of my googling, it has to do with RAM? The setting of the VM was exactly the same with your 1st tutorial.

Your reply is highly appreciated, Danny! Thanks

farrkf

Hi Danny, I have no words to thank you for this incredible lessons.I
have a question how the code will be if Iam using data from a company.the file is tar.gz.sorry for the silly question Iam a beginner.thank you again

zahraamasrawi

RNA Sequencing - Building a FASTQ to BAM pipeline

Ask a Scientist: How does RNA sequencing work?

StatQuest: A gentle introduction to RNA-seq

RNA sequencing library preparation

Introduction to RNA Sequencing and Analysis

RNA Sequencing - Building a FASTQ to BAM pipeline

Part I: Introduction to Illumina's RNA Library Preparation Workflows

Single cell RNA sequencing overview | ScRNA seq vs Bulk seq | chemistry of ScRNA seq |Bio Techniques

Improved RNA-Seq Data Quality from Low-Input and FFPE Samples Using Enzyme-Free Ribosomal RNA...

METAGENE-1: A Metagenomic Foundation Model

The Beginner's Guide to RNA-Seq - #ResearchersAtWork Webinar Series

RNA-seq Analysis 2023 | 01: Introduction to RNA Sequencing

RNA Sequencing: Part III - Introduction to Analysis

Webinar #11 - Beginner's guide to bulk RNA-Seq analysis

The Beginner's guide to bulk RNA sequencing vs single-cell RNA Sequencing

8. RNA-sequence Analysis: Expression, Isoforms

Single Cell RNA Sequencing vs. Bulk RNA Sequencing

What is Strandedness in RNA-Seq data? | RNA-Seq Stranded Library Construction Methods

How to analyze RNA-Seq data? Find differentially expressed genes in your research.

Single Cell Sequencing - Eric Chow (UCSF)

'Mapping' the body with single-cell RNA sequencing

RNA-Seq analysis pipeline, Nicolas Robine, Ph.D.

Introduction to RNA Sequencing and Analysis

RNASeq Overview

Introducing a tool to get valuable RNA-seq insights in hours