RNA-seq Analysis Using Old Sequgio (deprecated)

Welcome

This site is for analyzing RNA-sequence using tophat, cufflinks and sequgio. Mostly the examples are given using UPPMAX (http://www.uppmax.uu.se/) facilities.

Data Preparation

The following data/files should be provided:

Fastq files
Human Reference Genome (Fasta file)
human reference genome annotation database (B37 from EnsEMBL or hg19) (gtf file)

Useful information the RNA-seq pipepline can be found here: http://nestor.uppnex.se/twiki/bin/view/Courses/CM1209/TranscriptomeMappingFirst

Alignment

For Alignment, we use TopHat that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. The manual of Tophat can be found here: http://tophat.cbcb.umd.edu/manual.shtml. It is highly recommended to read Tophat’s manual before running the following examples.

Using TopHat:

Example of a shell code:

#!/bin/bash -l

#SBATCH -A b2012036

#SBATCH -p node -n 8

#SBATCH -t 40:00:00

#SBATCH –mail-user=user@ki.se

#SBATCH –mail-type=ALL

#SBATCH -J tophat

module load bioinfo-tools

module load tophat/1.4.0

tophat -o INBOX/BRCA/batch1/tophat.output.SRR327626 -p 8 –no-novel-juncs –library-type=fr-unstranded -G reference/genes.gtf reference/BowtieIndex/genome INBOX/BRCA/batch1/SRR327626_1.fastq.gz INBOX/BRCA/batch1/SRR327626_2.fastq.gz

Expression Quantification

Cufflinks

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols. More detail and user manual of cufflinks can be found here: http://cufflinks.cbcb.umd.edu/

Using Cufflinks:

module load bioinfo-tools

module load cufflinks/2.0.2

cufflinks -o INBOX/BRCA/batch1/Cuffres.SRR327626 -p 8 -G reference/genes.gtf -b reference/BowtieIndex/genome.fa INBOX/BRCA/batch1/tophat.output.SRR327626/accepted_hits.bam

-G for option no novel transcripts assembled.

Both Tophat and cufflinks could be run simultaneously. Example for batch 7: (file: topcuf_batch7sc.txt):

#!/bin/bash -l

#SBATCH -A b2012036

#SBATCH -p node -n 8

#SBATCH -t 60:00:00

#SBATCH –mail-user=user@ki.se

#SBATCH –mail-type=ALL

#SBATCH -J TCB7sc

module load bioinfo-tools

module load tophat/1.4.0

module load cufflinks/2.0.2

tophat -o /scratch/tophat.outputsc.$1 \

-p 8 –library-type=fr-unstranded \

-G reference/genes.gtf reference/BowtieIndex/genome \

INBOX/BRCA/batch7/$1_1.fastq.gz \

INBOX/BRCA/batch7/$1_2.fastq.gz

cp -r /scratch/tophat.outputsc.$1 INBOX/BRCA/batch7/

cufflinks -o /scratch/CuffresGsc.$1 \

-p 8 -G reference/genes.gtf \

-b reference/BowtieIndex/genome.fa \

INBOX/BRCA/batch7/tophat.outputsc.$1/accepted_hits.bam

cp -r /scratch/CuffresGsc.$1 INBOX/BRCA/batch7/

We input sample names $1 in the batch submission:

sbatch topcuf_batch7sc.txt SRR327626

Note that the output of each step (Tophat and cufflinks) is stored temporarily in the local disk before copied to the project directory . Please read explanation about SCRATCH here: http://www.uppmax.uu.se/disk-storage-guide

To submit all samples, we use R for submitting in parallel:

#################################

## Run Tophat and Cufflinks ###

## allowing novel transcripts ###

## BATCH 7 ###

#################################

setwd('/lynx/cvol/v25/b2012036/INBOX/BRCA/')

f.long = dir(recursive=TRUE)

ffastq <- f.long[grep(".fastq.gz", f.long)]

SRR.batch7 = unique(substr(ffastq ,1,9))

write.csv2(SRR.batch7,file="SRR.batch7.csv")

setwd('/lynx/cvol/v25/b2012036/INBOX/BRCA/')

topcufMC7sc = function(f){

cmd =paste('sbatch topcuf_batch7sc.txt', f,sep=' ')

system(cmd)

}

for (i in 1:length(SRR.batch7) ) topcufMC7sc(SRR.batch7[i])

More R code are in: run paralel.R

Sequgio

Getting TXDB (runReshape.R)

library(Sequgio)

dbfile <- "GRCh37.69.sqlite"

mybio <- loadDb(dbfile)

mparam <- MulticoreParam(8)

txdb <- reshapeTxDb(mybio,probelen=50L,with.junctions=T,mcpar=mparam)

save(txdb,file= "txdb37.RData")

Make design matrix (makeDesign.R):

mparam <- MulticoreParam(16)

attr(txdb,"probelen") = 50L

Design <- makeXmatrix(txdb,method="PE",mulen=200,sdlen=80,mcpar=mparam)

save(Design, file="Design.RData")

* For each bamfile, fix the qname. Different types of header has different regex parameters (-r and -s) for fixQNAME.py. For example: header pattern = UNC9-SN296_240:1:1101:10000:104941/1 and UNC9-SN296_240:1:1101:10000:104941/2 then regex = -r “/\d+$” -s “”

header pattern = SRR039629.1000004 and SRR039628.1000004 then regex = -r “(SRR)(\d+)(\.\d+)$” -s “\g<1>12829\g<3>”

export PYTHONPATH=/home/dhany/pysam-0.7.5/lib64/python2.6/site-packages/

cd /home/dhany/pysam-0.7.5/

python setup.py install –prefix /home/dhany/pysam-0.7.5

date; python /home/dhany/fixQNAME.py -i yourfile.bam -o yourfile.fixed.bam -r “/\d+$” -s “”; date

Get counts runGetcountsBatch7.R:

library(Sequgio)

mparam <- MulticoreParam(8)

load( "txdb37.RData")

loc <- "/pica/h1/setia/BRCA/INBOX/BRCA/batch7/"

files <- dir(path=loc )

files <- files[grep("tophat.outputsc",files)]

args=(commandArgs(TRUE))

args

args[[1]]

if(length(args)==0)

stop("No chromosome supplied.")

eval(parse(text=args[[1]]))

samples <- substr(files[i],17,25)

samples

target <- data.frame(filenames= paste(loc, "tophat.outputsc.", samples, "/accepted_hits.bam", sep="") ,

samplenames=samples ,

index=paste(loc, "tophat.outputsc.", samples, "/accepted_hits.bam.bai", sep=""),stringsAsFactors=FALSE)

allCounts.bigM <- getCounts(target,txdb ,mcpar=mparam,mapq.filter= 30,use.samtools=T)

allCounts <- as.matrix(allCounts.bigM[,,drop=FALSE])

save(allCounts, file= paste("/bubo/home/h1/setia/BRCA/Sequgio/batch7/allCounts.",samples,".RData", sep="") )

The input of that code is an index i for a sample defined in (Getcountbatch7.txt):

#!/bin/bash -l

#SBATCH -A b2012036

#SBATCH -p node -n 8

#SBATCH -t 5:00:00

#SBATCH -C mem72GB

#SBATCH --mail-user=setia.pramana@ki.se

#SBATCH --mail-type=ALL

#SBATCH -J CountB7

module load bioinfo-tools

module load GATK

module load samtools/0.1.18

R < runGetcountsBatch7.R --no-save $1

For example for the first sample we can run:

sbatch Getcountbatch7.txt '--args i=1'

For submittiong multi samples, use the following sbatch command (MultsubmitBatch7.txt):

#!/bin/bash -l

#SBATCH -A b2012036

#SBATCH -p core -n 5

#SBATCH -t 15:00 --qos=short

#SBATCH -J submit

#SBATCH --mail-user=setia.pramana@ki.se

#SBATCH --mail-type=ALL

##################

p="--args"

u=" i="

v="sbatch Getcountbatch7.txt"

# sbatch MultsubmitBatch7.txt #

for i in {1..40}

do

echo $v \'$p$u$i\'

eval $v \'$p$u$i\'

done

Note that 40 is the number of samples in that batch.

sbatch MultsubmitBatch7.txt

Model fitting: runGetcountsBatch7.R

library(Sequgio)

setwd('/pica/h1/setia/BRCA/INBOX/Sequgio_TCGA/')

load('/pica/h1/setia/BRCA/INBOX/BRCA/txdb37.RData')

loc <- "/pica/h1/setia/BRCA/INBOX/Sequgio_TCGA/batch7/"

files <- dir(path=loc )

files <- files[grep("allCounts",files)]

allCountsMat <- NULL

for (i in 1:length(files)) {

load(paste(loc ,files[i],sep="" ) )

allCountsMat <- cbind(allCountsMat, allCounts )

rm(allCounts )

cat(i)

}

load('/pica/h1/setia/BRCA/Design37.RData')

## Fit Models ##

library(parallel)

gNames <- sapply(Design, function(x) strsplit(attributes(x)$dimnames[[2]][1], '__')[[1]][2])

names(gNames) <- gNames

names(Design) <- gNames

Thetas <- mclapply(gNames,fitModels,design=Design,counts=allCountsMat ,maxit=20,verbose=T, useC=F, ls.start=F, Q1=0.9)

save(Thetas , file='Thetas.batch7.RData')

The result of Sequgio is located in:

/lynx/cvol/v25/b2012036/INBOX/Sequgio_TCGA

Sequgio with Python (alternative way)

Creating TXDB –> the 5th line took 9 hours using 8 cores for GRCh reference.

library(Sequgio)

dbfile <- "/proj/b2012036/GRCh37.69.sqlite"

mybio <- loadDb(dbfile)

mparam <- MulticoreParam(8)

txdb <- reshapeTxDb(mybio,probelen=50L,with.junctions=T,mcpar=mparam)

save(txdb,file="/proj/b2012036/Dhany/Sequgio/txdbgrch.RData")

write.table(as.data.frame(txdb@unlistData), "/proj/b2012036/Dhany/Sequgio/txdb.sql", sep="\t")

db <- dbConnect(SQLite(), dbname=”/proj/b2012036/Dhany/Sequgio/grch3769.sqlite”)

dbWriteTable(conn=db, name=”humangenome”, value=”txdb.sql”, row.names=FALSE, header=FALSE, sep=”\t”)

Make design matrix –> the 4th line took 7 hours 40 min using 16 cores for GRCh reference.

library(Sequgio)

load(“/proj/b2012036/Dhany/Sequgio/txdbgrch.RData”)

mparam <- MulticoreParam(16)

attr(txdb,"probelen") = 50L

Design <- makeXmatrix(txdb,method="PE",mulen=200,sdlen=80, mcpar=mparam)

save(Design, file="/proj/b2012036/Dhany/Sequgio/DesignGrch.RData")

Get python count (run in bash, 8 nodes). 28 min preprocess (=6 min filtering + 10 min sorting + 12 min sorting multiple aln), 1.5 min separating into chr (line 2-5), 3 min for python getCounts (line 6-9) for a 6 GB bamfile.

PS: You need to download preprocess.sh and getPairCount.py. You can play around by changing parallel=other than 12 and number of nodes= other than 8 to get your own setting for faster implementation.

cd /proj/b2012036/INBOX/Dhany/newdata/tophat.outputsc.SRR328008/

bash /proj/b2012036/Dhany/Sequgio/preprocess.sh file=accepted_hits.bam

for (( i=1; i<=22; i++ )); do awk -v j=$i ‘{ if($3==j) print $0 }’ accepted_hits.bam.sortMA > accepted_hits.$i & done

awk ‘{ if($3==”X”) print $0 }’ accepted_hits.bam.sortMA > accepted_hits.X &

awk ‘{ if($3==”Y”) print $0 }’ accepted_hits.bam.sortMA > accepted_hits.Y &

wait

for (( i=1; i<=22; i++ )); do python /proj/b2012036/Dhany/Sequgio/getPairCount.py accepted_hits output.$i /proj/b2012036/Dhany/Sequgio/grch3769.sqlite $i & done

python /proj/b2012036/Dhany/Sequgio/getPairCount.py accepted_hits output.X /proj/b2012036/Dhany/Sequgio/grch3769.sqlite X &

python /proj/b2012036/Dhany/Sequgio/getPairCount.py accepted_hits output.Y /proj/b2012036/Dhany/Sequgio/grch3769.sqlite Y &

wait

cat output.* > outputfinal.txt

rm -f output.*

for (( i=1; i<=22; i++ )); do rm -f accepted_hits.$i & done

rm -f accepted_hits.X

rm -f accepted_hits.Y

Writing list of possible exon pairs (in R)

library(Sequgio)

load(“txdbgrch.RData”)

ex_list <- split(values(txdb@unlistData)$exon_name,values(txdb@unlistData)$tx_name)

reg_vec <- sapply(split(values(txdb@unlistData)$region_id,values(txdb@unlistData)$tx_name),function(x) x[1])

sizes_ex_list <- sapply(ex_list,length)

n.exons <- sum((sizes_ex_list^2+sizes_ex_list)/2)

exons.names <- unique(.Call("makeExNames",ex_list,reg_vec,as.integer(n.exons)))

write.table(exons.names, “exons19.txt”, row.names=FALSE, col.names=FALSE, quote=FALSE, sep=”\t”)

Joining counts of different samples into 1 file (in bash):

python /proj/b2012036/Dhany/Sequgio/makeAllcountMatrix.py /proj/b2012036/Dhany/Sequgio/exons.txt 5 /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327626/output.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327734/outputfinal.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327735/outputfinal.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327736/outputfinal.txt /proj/b2012036/INBOX/Dhany/newdata/tophat.output.SRR327737/outputfinal.txt SRR327626 SRR327734 SRR327735 SRR327736 SRR327737 > sample.counts;

Model fitting:

Command:

bash /proj/b2012036/Dhany/Sequgio/batch_fastfitting.sh <size of Design matrix> <Design matrix.RData> <allCounts> <proj number> <emails for slurm status> <number of cores per job> <number of Design matrix size to process per batch> <your uppmax username>

Play around with <number of cores per job> and <number of Design matrix size to process per batch> to get better performance.

bash /proj/b2012036/Dhany/Sequgio/batch_fastfitting.sh31700 /proj/b2012036/Dhany/Sequgio/DesignGrch.RData /proj/b2012036/INBOX/Dhany/newdata/sample.count.grch b2012036 dhany.saputra@ki.se 4 500 dhany

cat myR.* > myfpkm.txt

Required FIles