RNA Sequencing - Building a FASTQ to BAM pipeline

preview_player
Показать описание
Learn how to create a computational RNA sequencing pipeline using free and open source bioinformatics software. We will use the R to download the _Saccharomyces cerevisiae_ genome, transcriptome, and known SNPs from Ensembl and use a publicly available paired-end RNA-Seq data set from the Sequence Read Archive (SRA) to develop and test our pipeline in R. Learn how to trim reads, align them, as well as further steps like duplicate removal and howto use the GATK for base re-calibration.

Next time, we'll make the pipeline more generic, and use it to run several data sets and extract RPKM values from our aligned data for differential expression analysis.

Chapters:
00:00:00 - Sound check and introduction
00:01:07 - Overview for today
00:02:53 - Install software I forgot
00:07:43 - Building a primary_assembly reference genome
00:22:25 - Download the transcriptome and known SNPs
00:27:42 - Creating the required genomic index files
00:34:38 - Building the RNA sequencing pipeline
00:37:00 - Execute external commands using R
00:42:42 - Static variables and the folder structure
00:47:00 - Automate downloading reads from SRA
00:51:27 - Trimming reads using Trimmomatic
01:05:43 - RNA paired-end alignment using STAR
01:12:00 - Samtools: BAM index and alignment statistics
01:28:12 - Picard tools: Duplicate removal and readgroup information
01:38:23 - GATK: Base re-calibration using known SNPs
01:45:24 - IGV: Visualize genome, transcriptome, and aligned reads
01:53:26 - What we'll do next time and Outro

#rnaseq #howto #bioinformatics #computationalbiology #academicyoutube #sequencing #ngs #nextgenerationsequencing #educationalvideos #biostreaming #academia #software #biologylecture #virtualmachine #debian #virtualbox #sratoolkit
Рекомендации по теме
Комментарии
Автор

UPDATE APRIL 2024

Thanks for the engagement, comments and feedback Due to updates to STAR and PICARD tools, two additional steps (git checkout) are required to get the versions used in the video. I have updated the "0_installSoftware" script to make sure the correct versions are used. Please let me know if you get stuck on any additional steps.

DannyArends
Автор

Dear Danny! I am inspired by your teaching style. I loved the series and found it very very useful and easy to follow. A lot of times coding abilities come with an attitude which gets in the way when people teach. You are the best! Thank you for this series and waiting to learn more!

jaypatankar
Автор

Thank you so much for the batch 2 of the tutorial - it's super helpful

augustinechukwunta
Автор

Thank you very much for these lectures/tutorials. They are precious to me. I am biologist who struggle to comprehend a lot from was presented here, especially the R codes. I replicate your steps on WSL and I get stuck at IGV, an error came with "no X11 etc.". I spend ages and installed putty and tried to forward and at the end I gave up and will try dual booting or virtual Linux like you. (Also, another error came when using bgzip that it is not known although i installed it and tried t on another ubuntu and it works, very weird. (By the way, sratoolkit on ubuntu, like what you told us to download has different path: it does not have etc or usr, just directly to the folder.)) Oh, the people who are trying to imitate bioinformatician need to know enormous amount of information (plus and pardon me in this, the true bioinformatician themselves seem complicate the matter and making so different versions and steps). I will try to set up account on twitch to follow you also there. Looking forward for the next lecture, IN SHAA ALLAH. Thanks a lot and best wishes. Mohamed.

testforall
Автор

Thank you very much .. These videos are very helpful.
btw which video games r u playing? ;)

histephenson
Автор

Good day, thank you for the video, very informative. I just want to know if i can use the vitual machine for human genome? I will be checking for gene expression levels between two groups and i figured that is a large data set. I will appreciate your guidance in this. thanks

nomawethumasina
Автор

I recently saw that you would be making an explanatory video to address some issues that have arisen with PICARD. Some time ago, I installed all the software and everything worked great with the analysis. Although I haven't used them again... I have a question: Will all the software I have installed still work, or might it encounter errors due to updates? I would like to know if you will be publishing the video from April 27 on any platform. Thank you.

CHD.
Автор

I am reading a document called "RNA-Seq workflow: gene-level exploratory analysis and differential expression" and i am very confused about the workflow. They say that transcript abundance quantification methods like Kallisto, Salmon to estimate abundances without aligning reads, followed by tximport package are better because skip the generation of large files...in this video in what moment and with what tool do we do this? i guess i still do not understand all the workflow and all the tools that can be used. Thanks a lot

Iceletters
Автор

Thank you very much for these lectures/tutorials. Upon executing bgzip command the following error msg was shown
> cmd <- paste0("bgzip -k
#cat(cmd, "\n")
system(cmd)
bgzip: invalid option -- 'k'
Version: 1.13+ds
Usage: bgzip [OPTIONS] [FILE] ...

vinayaraj
Автор

Minute 29:40, the file that tabix read is still compressed?

juliangrandvallet
Автор

I am using a UTM virtual Ubuntu on my M2 MacBook Air, can I use the softwares directly on MacBook terminal? Because the Ubuntu is taking a lot of space on my system 😢

JaskaranSingh-ommv
Автор

Thank you so much for this informative series! I have a question: What if we cannot find the VCF file on ensemble? Do we skip those parts? Is there somewhere else to look? My organism is Papio anubis. Thank you!

laurynwinter
Автор

Hi! Thanks for the video, very instructive!! I have a quick question: Does Picard automatically detect and delete unwanted duplicates from the bam file? or is there a risk to loose biological data information when performing this step?

sMr_Borgov
Автор

HI Danny. help me
"I can't find the index file gtf.gz in the Ensembl FTP files. Where can I download it from? My organism is Capsicum annuum."

conlosjuguetesdemihijo
Автор

Thanks for your video.. just quick question... among trimming methods, there is also called ngsshort package except for trimmomatic... do you think I can also use ngsshort rather than trimmomatics for rna-seq data?

freezingtolerance
Автор

Thank you for your tutorial, There is any way to make the Index genome in star with less than 8Gb RAM (All i assigned to virtual box)?

danieljairenriquezvera
Автор

I want to do RNAseq data analysis using the human genome. When reading the following lines in, I get a failure error:

# Generate genome/transcriptome index using STAR
STAR --runThreadN 2 --runMode genomeGenerate \
--genomeDir ~/genome/STAR \
--genomeSAindexNbases 10 \
--sjdbGTFfile \
--genomeFastaFiles

STAR version: 2.7.11b compiled: 2024-01-25T16:12:02-05:00
Apr 06 00:23:00 started STAR run
Apr 06 00:23:00 ... starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Afgebroken

What is going wrong?

dariocosemans
Автор

Thanks for the great video. When I try to run the script as one it stops after Step 4.1 and does not start Step 5. Just stays with a plus sign like if something is missing from the code.

solomonantonio
Автор

Hi Danny, thank you again for the second series of the tutorial!

I followed your tutorial step-by-step but I encountered an error in generating the index using STAR:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted

My input was similar to yours except that I changed the --genomeSAindexNbases to 14 as I'm working with the human sample:
STAR --runThreadN 2 --runMode genomeGenerate --genomeDir ~/genome/STAR --genomeSAindexNbases 14 --genomeFastaFiles --sjdbGTFfile

Correct me if I'm wrong but based of my googling, it has to do with RAM? The setting of the VM was exactly the same with your 1st tutorial.

Your reply is highly appreciated, Danny! Thanks

farrkf
Автор

Hi Danny, I have no words to thank you for this incredible lessons.I
have a question how the code will be if Iam using data from a company.the file is tar.gz.sorry for the silly question Iam a beginner.thank you again

zahraamasrawi