Pipeline for RNA-Seq data analysis¶
The pipeline is for RNA-Seq data analysis on condo, located in:
for gsnap and tophatp.
The configuration file for the pipeline is proj.cfg in:
Steps to run the pipeline¶
Transfer your data your home on /work/LAS/your-lab¶
- md5sum check if needed
- fastqc if needed.
- clean the reads if needed.
Select the pre-built indexed database for your species in:¶
or use your own indexed database.
Pre-build Indexed Databases are:
Species: commonly anaylized ones. Add other species upon request.
DataSource: NCBI, Ensembl and UCSD
BuildVersion: last three builds are online
Pre-indexed Database: BWAIndex, Bowtie2Index, BowtieIndex, GmapIndex
The annotation file is in:
Edit proj.cfg for RNA-Seq pipeline¶
The sample data is for Glycin Max, sequenced on 3 lanes for each sample, and combined together during alignment using gsnap or tophat.
The annotation is Gm01 from Ensembl. After the alignment, the raw read was counted using HTseq-count, and FPKM was calculated by cufflinks.
The tab-delimited FPKM/raw count summary files are created using the script summary_fpkm_ct.sh and summary_htseq_ct.sh for
the down-stream statistical analysis.
Example : RNA_Seq analysis using GSNAP, HTseq-count, and cufflinks.
to your home. Edit the project configuration file:
The meaning of the parameters is self explanatory.
cd gsnap-cufflinks-htseq/aln # check up gsnap\_align\_.sh qsub parallel-aln.qsub
then submit parallel-HTseq_ct.qsub.
The scripts summary_htseq_ct.sh and summary_htseq_ct_run.sh are for making the raw count summary file.
then submit parallel-cufflinks.qsub to run the job.
The scripts summary_fpkm_ct.sh and summary_fpkm_ct_run.sh are for making the tab-delimited FPKM summary file.