RNA-seq Analysis Using Old Sequgio (deprecated)

Welcome


This site is for analyzing RNA-sequence using tophat, cufflinks and sequgio. Mostly the examples are given using UPPMAX (http://www.uppmax.uu.se/) facilities.

Data Preparation


The following data/files should be provided:
  • Fastq files
  • Human Reference Genome (Fasta file)
  • human reference genome annotation database (B37 from EnsEMBL or hg19) (gtf file)
Useful information the RNA-seq pipepline can be found here: http://nestor.uppnex.se/twiki/bin/view/Courses/CM1209/TranscriptomeMappingFirst

Alignment


For Alignment, we use TopHat that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. The manual of Tophat can be found here: http://tophat.cbcb.umd.edu/manual.shtml. It is highly recommended to read Tophat’s manual before running the following examples.
Using TopHat:
 
Example of a shell code:
#!/bin/bash -l
#SBATCH -A b2012036
#SBATCH -p node -n 8
#SBATCH -t 40:00:00
#SBATCH –mail-user=user@ki.se
#SBATCH –mail-type=ALL
#SBATCH -J tophat
module load bioinfo-tools
module load tophat/1.4.0
tophat -o INBOX/BRCA/batch1/tophat.output.SRR327626 -p 8 –no-novel-juncs –library-type=fr-unstranded -G reference/genes.gtf reference/BowtieIndex/genome INBOX/BRCA/batch1/SRR327626_1.fastq.gz INBOX/BRCA/batch1/SRR327626_2.fastq.gz
 
 
Expression Quantification 

Cufflinks 

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. More detail and user manual of cufflinks can be found here: http://cufflinks.cbcb.umd.edu/
 
Using Cufflinks:
module load bioinfo-tools
module load cufflinks/2.0.2
cufflinks -o INBOX/BRCA/batch1/Cuffres.SRR327626 -p 8 -G reference/genes.gtf -b reference/BowtieIndex/genome.fa INBOX/BRCA/batch1/tophat.output.SRR327626/accepted_hits.bam
-G  for option no novel transcripts assembled.
 
Both Tophat and cufflinks could be run simultaneously. Example for batch 7: (file: topcuf_batch7sc.txt):
#!/bin/bash -l
#SBATCH -A b2012036
#SBATCH -p node -n 8
#SBATCH -t 60:00:00
#SBATCH –mail-user=user@ki.se
#SBATCH –mail-type=ALL
#SBATCH -J TCB7sc
 
module load bioinfo-tools
 
module load tophat/1.4.0 
 
module load cufflinks/2.0.2
 
tophat -o /scratch/tophat.outputsc.$1 \
 -p 8 –library-type=fr-unstranded \
 -G reference/genes.gtf reference/BowtieIndex/genome \
 INBOX/BRCA/batch7/$1_1.fastq.gz \
 INBOX/BRCA/batch7/$1_2.fastq.gz
 
 
cp -r /scratch/tophat.outputsc.$1 INBOX/BRCA/batch7/
 
 
cufflinks -o /scratch/CuffresGsc.$1 \
 -p 8 -G reference/genes.gtf \
 -b reference/BowtieIndex/genome.fa \
 INBOX/BRCA/batch7/tophat.outputsc.$1/accepted_hits.bam
 
 
 
cp -r /scratch/CuffresGsc.$1 INBOX/BRCA/batch7/
We input  sample names $1   in the batch submission:

sbatch topcuf_batch7sc.txt SRR327626

 
Note that the output of each step (Tophat and cufflinks) is stored temporarily in the local disk before copied to the project directory . Please read explanation about SCRATCH here: http://www.uppmax.uu.se/disk-storage-guide
To submit all samples, we use R for submitting in parallel:
 
#################################
## Run Tophat and Cufflinks   ###
## allowing novel transcripts ###
##  BATCH 7 ###
#################################      
 
setwd('/lynx/cvol/v25/b2012036/INBOX/BRCA/')
 
f.long = dir(recursive=TRUE)
ffastq <- f.long[grep(".fastq.gz", f.long)]
SRR.batch7 = unique(substr(ffastq ,1,9))
write.csv2(SRR.batch7,file="SRR.batch7.csv")
 
 
setwd('/lynx/cvol/v25/b2012036/INBOX/BRCA/')
 
topcufMC7sc = function(f){
 cmd =paste('sbatch topcuf_batch7sc.txt', f,sep=' ')
 system(cmd)
}
 
 
for (i in 1:length(SRR.batch7)  )  topcufMC7sc(SRR.batch7[i])
More R code are in:   run paralel.R
Sequgio

Getting TXDB (runReshape.R)
library(Sequgio)
dbfile <- "GRCh37.69.sqlite"
mybio <- loadDb(dbfile)
mparam <- MulticoreParam(8)
txdb <- reshapeTxDb(mybio,probelen=50L,with.junctions=T,mcpar=mparam)
save(txdb,file= "txdb37.RData")
Make design matrix (makeDesign.R):
mparam <- MulticoreParam(16)
attr(txdb,"probelen") = 50L
Design <- makeXmatrix(txdb,method="PE",mulen=200,sdlen=80,mcpar=mparam)
save(Design, file="Design.RData")
* For each bamfile, fix the qname. Different types of header has different regex parameters (-r and -s) for fixQNAME.py. For example: header pattern = UNC9-SN296_240:1:1101:10000:104941/1 and UNC9-SN296_240:1:1101:10000:104941/2 then regex = -r “/\d+$” -s “”
header pattern = SRR039629.1000004 and SRR039628.1000004 then regex = -r “(SRR)(\d+)(\.\d+)$” -s “\g<1>12829\g<3>”
 
 
export PYTHONPATH=/home/dhany/pysam-0.7.5/lib64/python2.6/site-packages/
cd /home/dhany/pysam-0.7.5/
python setup.py install –prefix /home/dhany/pysam-0.7.5
date; python /home/dhany/fixQNAME.py -i yourfile.bam -o yourfile.fixed.bam -r “/\d+$” -s “”; date

Get countrunGetcountsBatch7.R:

library(Sequgio)
mparam <- MulticoreParam(8)
load( "txdb37.RData")
loc <-  "/pica/h1/setia/BRCA/INBOX/BRCA/batch7/"
files <- dir(path=loc )
files <- files[grep("tophat.outputsc",files)]
args=(commandArgs(TRUE))
args
args[[1]]
if(length(args)==0)
    stop("No chromosome supplied.")
eval(parse(text=args[[1]]))
samples <- substr(files[i],17,25)
samples 
target <- data.frame(filenames= paste(loc, "tophat.outputsc.", samples, "/accepted_hits.bam", sep="") , 
samplenames=samples ,
index=paste(loc, "tophat.outputsc.", samples, "/accepted_hits.bam.bai", sep=""),stringsAsFactors=FALSE)
allCounts.bigM <- getCounts(target,txdb ,mcpar=mparam,mapq.filter= 30,use.samtools=T)
allCounts <- as.matrix(allCounts.bigM[,,drop=FALSE])
save(allCounts, file= paste("/bubo/home/h1/setia/BRCA/Sequgio/batch7/allCounts.",samples,".RData", sep="") )
The input of that code is an index i for a sample defined in (Getcountbatch7.txt):
#!/bin/bash -l
#SBATCH -A b2012036
#SBATCH -p node -n 8
#SBATCH -t 5:00:00
#SBATCH -C mem72GB
#SBATCH --mail-user=setia.pramana@ki.se
#SBATCH --mail-type=ALL
#SBATCH -J CountB7
module load bioinfo-tools
module load GATK
module load samtools/0.1.18
R < runGetcountsBatch7.R --no-save $1
For example for the first sample we can run:
sbatch Getcountbatch7.txt '--args i=1'
For submittiong multi samples, use the following sbatch command (MultsubmitBatch7.txt):
#!/bin/bash  -l
#SBATCH -A b2012036
#SBATCH -p core -n 5
#SBATCH -t 15:00 --qos=short
#SBATCH -J submit
#SBATCH --mail-user=setia.pramana@ki.se
#SBATCH --mail-type=ALL
##################
p="--args"
u=" i="
v="sbatch Getcountbatch7.txt"
# sbatch MultsubmitBatch7.txt #
for i in {1..40}
do
echo $v \'$p$u$i\'
eval $v \'$p$u$i\'
done
Note that 40 is the number of samples in that batch.
sbatch MultsubmitBatch7.txt
 
 
 
Model fittingrunGetcountsBatch7.R
library(Sequgio)
setwd('/pica/h1/setia/BRCA/INBOX/Sequgio_TCGA/')
load('/pica/h1/setia/BRCA/INBOX/BRCA/txdb37.RData')
loc <-  "/pica/h1/setia/BRCA/INBOX/Sequgio_TCGA/batch7/"
files <- dir(path=loc )
files <- files[grep("allCounts",files)]
allCountsMat <- NULL
for (i in 1:length(files)) {
load(paste(loc ,files[i],sep="" )    )
allCountsMat <- cbind(allCountsMat, allCounts )
      rm(allCounts )
cat(i)
}
load('/pica/h1/setia/BRCA/Design37.RData')
## Fit Models ##
 library(parallel)
       gNames <- sapply(Design, function(x) strsplit(attributes(x)$dimnames[[2]][1], '__')[[1]][2])
        names(gNames) <- gNames
        names(Design) <- gNames
Thetas <- mclapply(gNames,fitModels,design=Design,counts=allCountsMat ,maxit=20,verbose=T, useC=F, ls.start=F, Q1=0.9)
save(Thetas , file='Thetas.batch7.RData')
The result of Sequgio is located in:
/lynx/cvol/v25/b2012036/INBOX/Sequgio_TCGA
Sequgio with Python (alternative way)

 
 
Creating TXDB –> the 5th line took 9 hours using 8 cores for GRCh reference.
library(Sequgio)
dbfile <- "/proj/b2012036/GRCh37.69.sqlite"
mybio <- loadDb(dbfile)
mparam <- MulticoreParam(8)
txdb <- reshapeTxDb(mybio,probelen=50L,with.junctions=T,mcpar=mparam)
save(txdb,file="/proj/b2012036/Dhany/Sequgio/txdbgrch.RData")
write.table(as.data.frame(txdb@unlistData), "/proj/b2012036/Dhany/Sequgio/txdb.sql", sep="\t")
db <- dbConnect(SQLite(), dbname=”/proj/b2012036/Dhany/Sequgio/grch3769.sqlite”)
dbWriteTable(conn=db, name=”humangenome”, value=”txdb.sql”, row.names=FALSE, header=FALSE, sep=”\t”)
Make design matrix –> the 4th line took 7 hours 40 min using 16 cores for GRCh reference.
library(Sequgio)
load(/proj/b2012036/Dhany/Sequgio/txdbgrch.RData”)
mparam <- MulticoreParam(16)
attr(txdb,"probelen") = 50L
Design <- makeXmatrix(txdb,method="PE",mulen=200,sdlen=80, mcpar=mparam)
save(Design, file="/proj/b2012036/Dhany/Sequgio/DesignGrch.RData")
Get python count (run in bash, 8 nodes). 28 min preprocess (=6 min filtering + 10 min sorting + 12 min sorting multiple aln), 1.5 min separating into chr (line 2-5), 3 min for python getCounts (line 6-9) for a 6 GB bamfile.
PS: You need to download preprocess.sh and getPairCount.py. You can play around by changing parallel=other than 12 and number of nodes= other than 8 to get your own setting for faster implementation.
cd /proj/b2012036/INBOX/Dhany/newdata/tophat.outputsc.SRR328008/
bash /proj/b2012036/Dhany/Sequgio/preprocess.sh file=accepted_hits.bam
for (( i=1; i<=22; i++ )); do awk -v j=$i ‘{ if($3==j) print $0 }’ accepted_hits.bam.sortMA > accepted_hits.$i & done
awk ‘{ if($3==”X”) print $0 }’ accepted_hits.bam.sortMA > accepted_hits.X &
awk ‘{ if($3==”Y”) print $0 }’ accepted_hits.bam.sortMA > accepted_hits.Y &
wait
for (( i=1; i<=22; i++ )); do python /proj/b2012036/Dhany/Sequgio/getPairCount.py accepted_hits output.$i /proj/b2012036/Dhany/Sequgio/grch3769.sqlite $i & done
python /proj/b2012036/Dhany/Sequgio/getPairCount.py accepted_hits output.X /proj/b2012036/Dhany/Sequgio/grch3769.sqlite X &
python /proj/b2012036/Dhany/Sequgio/getPairCount.py accepted_hits output.Y /proj/b2012036/Dhany/Sequgio/grch3769.sqlite Y &
wait
cat output.* > outputfinal.txt
rm -f output.*
for (( i=1; i<=22; i++ )); do rm -f accepted_hits.$i & done
rm -f accepted_hits.X
rm -f accepted_hits.Y
Writing list of possible exon pairs (in R)
library(Sequgio)
load(“txdbgrch.RData”)
ex_list <- split(values(txdb@unlistData)$exon_name,values(txdb@unlistData)$tx_name)
reg_vec <- sapply(split(values(txdb@unlistData)$region_id,values(txdb@unlistData)$tx_name),function(x) x[1])
sizes_ex_list <- sapply(ex_list,length)
n.exons <- sum((sizes_ex_list^2+sizes_ex_list)/2)
exons.names <- unique(.Call("makeExNames",ex_list,reg_vec,as.integer(n.exons)))
write.table(exons.names, “exons19.txt”, row.names=FALSE, col.names=FALSE, quote=FALSE, sep=”\t”)
 
 
Joining counts of different samples into 1 file (in bash):
python /proj/b2012036/Dhany/Sequgio/makeAllcountMatrix.py /proj/b2012036/Dhany/Sequgio/exons.txt 5 /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327626/output.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327734/outputfinal.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327735/outputfinal.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327736/outputfinal.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327737/outputfinal.txt SRR327626 SRR327734 SRR327735 SRR327736 SRR327737 > sample.counts;
 
 
Model fitting:
 
Command:
bash /proj/b2012036/Dhany/Sequgio/batch_fastfitting.sh <size of Design matrix> <Design matrix.RData> <allCounts> <proj number> <emails for slurm status> <number of cores per job> <number of Design matrix size to process per batch> <your uppmax username>
Play around with <number of cores per job> and <number of Design matrix size to process per batch> to get better performance.
 
bash /proj/b2012036/Dhany/Sequgio/batch_fastfitting.sh31700 /proj/b2012036/Dhany/Sequgio/DesignGrch.RData /proj/b2012036/INBOX/Dhany/newdata/sample.count.grch b2012036 dhany.saputra@ki.se 4 500 dhany
cat myR.* > myfpkm.txt
Required FIles