Category Archives: RNA-seq

XAEM

1. Introduction
2. Download and installation
3. XAEM: step by step instruction and explanation
3.1 Preparation for the annotation reference
3.2 Quantification of transcripts
4. A practical copy-paste example of running XAEM
5. Dataset for differential expression (DE) analysis

1. Introduction

This document shows how to use XAEM [Deng et al., 2019] to quantify isoform expression for multiple samples.

What are new in version 0.1.2

Improve speed and fix bug for building CRP to work with complex annotations such as GENCODE and ENSEMBL, which usually have >200,000 isoforms for hg38. The X-matrix for human Ensembl GRCh38.95 can be downloaded here: X_matrix.RData

What are new in version 0.1.1

Add standard error for the estimates
Fix a small bug when separe a CRP into more than 1 CRP due to H_thres
Fix a small bug in function crpcount() to avoid the error when having only 1 CRP

Older versions

Code, data and instruction of most XAEM versions are available on the XAEM github site
Webpage of XAEM version 0.1.1: click here to get there
Webpage of XAEM version 0.1.0: click here to get there

Software requirements for XAEM:

R version 3.3.0 or later with installed packages: foreach and doParallel
C++11 compliant compiler (g++ >= 4.7)
XAEM is currently tested in Linux OS environment

Annotation reference: XAEM requires a fasta file of transcript sequences and a gtf file of transcript annotation. XAEM supports all kinds of reference and annotation for any species.

The pre-built X-matrix for GRCh38.95 can be downloaded here: X_matrix.RData

In the XAEM paper, we use the UCSC hg19 annotation:

Download the sequences of transcripts:transcripts.fa.gz
Download the annotation of transcripts: genes_annotation.gtf.gz
Download the design matrix X of this annotation: X_matrix.RData (X matrix is an essential object for bias correction and isoform quantification, see Section 4.1.2 for more details)

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz
content/uploads/sites/4/XAEM_datasources/genes_annotation.gtf.gz
gunzip genes_annotation.gtf.gz
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

2. Download and installation

If you use the binary version of XAEM (recommended):

Download the latest binary version from XAEM website:

wget https://github.com/WenjiangDeng/XAEM/releases/download/v0.1.2/XAEM-binary-0.1.2.tar.gz

Uncompress to folder

tar -xzvf XAEM-binary-0.1.2.tar.gz

Move to the XAEM_home directory and do the configuration for XAEM

cd XAEM-binary-0.1.2
bash configure.sh

Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH

export LD_LIBRARY_PATH=/path/to/XAEM-binary-0.1.2/lib:$LD_LIBRARY_PATH
export PATH=/path/to/XAEM-binary-0.1.2/bin:$PATH

If you want to build XAEM from sources:

Download XAEM and move to XAEM_home directory

wget https://github.com/WenjiangDeng/XAEM/releases/download/v0.1.2/XAEM-source-0.1.2.tar.gz
tar -xzvf XAEM-source-0.1.2.tar.gz
cd XAEM-source-0.1.2
bash configure.sh

XAEM requires information of flags from Sailfish including DFETCH_BOOST, DBOOST_ROOT, DTBB_INSTALL_DIR and DCMAKE_INSTALL_PREFIX. Please refer to the Sailfish website for more details of these flags.
Do installation by the following command:

DBOOST_ROOT=/path/to/boostDir/ DTBB_INSTALL_DIR=/path/to/tbbDir/ DCMAKE_INSTALL_PREFIX=/path/to/expectedBuildDir bash install.sh

After the installation is finished, remember to add the paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH

export LD_LIBRARY_PATH=/path/to/expectedBuildDir/lib:$LD_LIBRARY_PATH
export PATH=/path/to/expectedBuildDir/bin:$PATH

Do not forget to replace “/path/to/” by your local path.

3. XAEM: step by step instruction and explanation

XAEM mainly contains the following steps:

Preparation for the annotation reference: to process the annotation of transcripts to get essential information for transcript quantification. This step includes 1) index transcript sequences and 2) Construct the design matrix X.
Quantification of transcripts: to get input from multiple RNA-seq samples to do quasi-mapping, generate data for quantifying transcript expression. This step consists of 1) generate equivalence class table; 2) create Y count matrix and 3) estimate transcript expression using AEM algorithm to update the X matrix and transcript (isoform) expression.

3.1 Preparation for the annotation reference

3.1.1 Indexing transcripts

Using TxIndexer to index the transcript sequences in the reference file (transcripts.fa). For example:

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz
TxIndexer -t /path/to/transcripts.fa -o /path/to/TxIndexer_idx

3.1.2 Construction of the X matrix (design matrix)

This step constructs the X matrix required by the XAEM pipeline. For users working with human annotation of UCSC hg19 the X matrix can be downloaded here: X_matrix.rdata (need to rename the file to X_matrix.RData).

Given file transcripts.fa containing the transcript sequences of an annotation reference, we construct the design matrix as follows.

a) Generate simulated RNA-seq data using the R-package “polyester”

## R-packages of "polyester" and "Biostrings" are required
Rscript XAEM_home/R/genPolyesterSimulation.R /path/to/transcripts.fa /path/to/design_matrix

b) Run GenTC to generate Transcript Cluster (TC) using the simulated data. GenTC will generate an eqClass.txt file as the input for next step.

GenTC -i /path/to/TxIndexer_idx -l IU -1 /path/to/design_matrix/sample_01_1.fasta -2 /path/to/design_matrix/sample_01_2.fasta -p 8 -o /path/to/design_matrix

c) Create a design matrix using buildCRP.R. The parameter setting for this function is as follows.
- in: the input file (eqClass.txt) obtained from the last step.
- out: the output file name (*.RData) which the design matrix will be saved.
- H: (default H=0.025) is the threshold to filter false positive neighbors in each X matrix. (Please refer to the XAEM paper, Section 2.2.1)

Rscript XAEM_home/R/buildCRP.R in=/path/to/design_matrix/eqClass.txt out=/path/to/design_matrix/X_matrix.RData H=0.025

3.2 Quantification of transcripts

Suppose we already created a working directory “XAEM_project” (/path/to/XAEM_project/) for quantification of transcripts.

3.2.1 Generating the equivalence class table

The command to generate equivalence class table for each sample is similar to “sailfish quant”. For example, we want to run XAEM for sample1 and sample2 with 4 cpus:

XAEM -i /path/to/TxIndexer_idx -l IU -1 s1_read1.fasta -2 s1_read2.fasta -p 4 -o /path/to/XAEM_project/sample1
XAEM -i /path/to/TxIndexer_idx -l IU -1 s2_read1.fasta -2 s2_read2.fasta -p 4 -o /path/to/XAEM_project/sample2

If the data is compressed in gz format. We can combine with gunzip for a decompression on-fly:

XAEM -i /path/to/TxIndexer_idx -l IU -1 <(gunzip -c s1_read1.gz) -2 <(gunzip -c s1_read2.gz) -p 4 -o /path/to/XAEM_project/sample1
XAEM -i /path/to/TxIndexer_idx -l IU -1 <(gunzip -c s2_read1.gz) -2 <(gunzip -c s2_read2.gz) -p 4 -o /path/to/XAEM_project/sample2

3.2.2 Creating Y count matrix

After running XAEM there will be the output of the equivalence class table for multiple samples. We then create the Y count matrix. For example, if we want to run XAEM parallelly using 8 cores, the command is:

Rscript Create_count_matrix.R workdir=/path/to/XAEM_project core=8

3.2.3 Updating the X matrix and transcript expression using AEM algorithm

When finish the construction of Y count matrix, we use the AEM algorithm to update the X matrix. The updated X matrix is then used to estimate the transcript (isoform) expression. The command is as follows.

Rscript AEM_update_X_beta.R workdir=/path/to/XAEM_project core=8 design.matrix=X_matrix.RData isoform.out=XAEM_isoform_expression.RData paralog.out=XAEM_paralog_expression.RData merge.paralogs=FALSE isoform.method=average remove.ycount=TRUE

Parameter setting

workdir: the path to working directory
core: the number of cpu cores for parallel computing
design.matrix: the path to the design matrix
isoform.out (default=XAEM_isoform_expression.RData): the output contains the estimated expression of individual transcripts, where the paralogs are split into separate isoforms. This file contains two objects: isoform_count and isoform_tpm for estimated counts and normalized values (TPM). The expression of the individual isoforms is calculated with the corresponding setting of parameter “isoform.method” below.
isoform.method (default=average): to report the expression of the individual members of a paralog as the average or total expression of the paralog set (value=average/total).
paralog.out (default=XAEM_paralog_expression.RData): the output contains the estimated expression of merged paralogs. This file consists of two objects: XAEM_count and XAEM_tpm for the estimated counts and normalized values (TPM). The standard error of the estimate is supplied in object XAEM_se stored in *.standard_error.RData.
merge.paralogs (default=TRUE) (*): the parameter to turn on/off (value=TRUE/FALSE) the paralog merging in XAEM. Please see the details of how to use this parameter in the note at the end of this section.
remove.ycount (default=TRUE): to clean all data of Ycount after use.

The output in this step will be saved in XAEM_isoform_expression.RData, which is the TPM value and raw read counts of multiple samples.

Note: (*) In XAEM pipeline we provide this parameter (merge.paralog) to merge or not merge the paralogs within the updated X matrix (please see XAEM paper Section 2.2.3 and Section 2.3). Turning on (default) the paralog merging step produces a more accurate estimation. Turning off this step can produce the same sets of isoforms between different projects.

4. A practical copy-paste example of running XAEM

This section presents a tutorial to run XAEM pipeline with a toy example. Suppose that input data contain two RNA-seq samples and server supplies 4 CPUs for computation. We can test XAEM by just copy and paste of the example commands.

Download the binary version of XAEM and do configuration

# Create a working folder
mkdir XAEM_example
cd XAEM_example
# Download the binary version of XAEM
wget https://github.com/WenjiangDeng/XAEM/releases/download/v0.1.2/XAEM-binary-0.1.2.tar.gz

# Configure the tool
tar -xzvf XAEM-binary-0.1.2.tar.gz
cd XAEM-binary-0.1.2
bash configure.sh

# Add the paths to system
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
export PATH=$PWD/bin:$PATH
cd ..

Download annotation files and index the transcripts

## download annotation files
# Download the design matrix for the human UCSC hg19 annotation 
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

# Download the fasta of transcripts in the human UCSC hg19 annotation 
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz

## Run XAEM indexer
TxIndexer -t transcripts.fa -o TxIndexer_idx

If using GRCh38.95, download the corresponding annotation files (Homo_sapiens.GRCh38.95.cdna.all.fa and Homo_sapiens.GRCh38.95.gtf) from Ensembl (http://jan2019.archive.ensembl.org/Homo_sapiens/Info/Index) and the X-matrix of GRCh38.95 from here: https://www.dropbox.com/s/x6a693v1y7must0/X_matrix.RData

Download the RNA-seq data of two samples: sample1 and sample2

## Download input RNA-seq samples
# Create a XAEM project to save the data
mkdir XAEM_project
cd XAEM_project

# Download the RNA-seq data
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample1_read1.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample1_read2.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample2_read1.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample2_read2.fasta.gz
cd ..

Generate the equivalence class tables for these samples

# Number of CPUs
CPUNUM=4

# Process for sample 1
XAEM -i TxIndexer_idx -l IU -1 <(gunzip -c XAEM_project/sample1_read1.fasta.gz) -2 <(gunzip -c XAEM_project/sample1_read2.fasta.gz) -p $CPUNUM -o XAEM_project/sample1

# Process for sample 2
XAEM -i TxIndexer_idx -l IU -1 <(gunzip -c XAEM_project/sample2_read1.fasta.gz) -2 <(gunzip -c XAEM_project/sample2_read2.fasta.gz) -p $CPUNUM -o XAEM_project/sample2

Create Y count matrix

# Note: R packages "foreach" and "doParallel" are required for parallel computing
Rscript $PWD/XAEM-binary-0.1.2/R/Create_count_matrix.R workdir=$PWD/XAEM_project core=$CPUNUM design.matrix=$PWD/X_matrix.RData

Estimate isoform expression using AEM algorithm

Rscript $PWD/XAEM-binary-0.1.2/R/AEM_update_X_beta.R workdir=$PWD/XAEM_project core=$CPUNUM design.matrix=$PWD/X_matrix.RData isoform.out=XAEM_isoform_expression.RData paralog.out=XAEM_paralog_expression.RData

The outputs are stored in the folder of “XAEM_project” including XAEM_isoform_expression.RData and XAEM_paralog_expression.RData.

5. Dataset for differential expression (DE) analysis

In XAEM paper we have used the RNA-seq data from the breast cancer cell line (MDA-MB-231) for DE analysis. Since the original data was generated by our collaborators and not published yet, we provide the equivalence class table by running the read-alignment tool Rapmap, which is the same mapper of Salmon and totally independent from XAEM algorithm. We also prepare the R scripts and the guide to replicate the DE analysis results in the paper.

In this section, we present an instruction to download the data and run the scripts. We try to build the pipeline following the copy-paste manner in shell, but the part of R scripts must be run in R console.

5.1 Download the R-scripts and the design matrix

This step is to download the R-scripts, change directory to the folder containing the R-scripts and download the design matrix.

# Download R-scripts
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/RDR_brca_singlecell.zip
unzip RDR_brca_singlecell.zip
cd RDR_brca_singlecell

# Download the design matrix
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

5.2 Run XAEM from the equivalence class tables which are the output of read-alignment tool Rapmap

Download the data of equivalence classes

# Download the table of equivalence classes of the single cells which are the output of read-alignment tool Rapmap

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/brca_singlecell_eqclassDir.zip
unzip brca_singlecell_eqclassDir.zip

Run XAEM with the input from the equivalence class table using the R-codes below. Note: This step takes about 2 hours using a personal computer with 4 CPUs. Users can consider skipping this step and downloading the available XAEM results for the downstream analysis.

# set the project path
projPath=getwd();
setwd(projPath)
source("collectDataOfXAEM.R")

If users want to download the available XAEM results

# Download the available results of XAEM

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/XAEM_results.zip
unzip XAEM_results.zip

5.3 Differential-expression analysis of XAEM and other methods

Download the data of cufflinks and salmon. These files contain the read-count data of methods with and without using bias correction.

# Download the results of cufflinks
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/cufflinks_results.zip
unzip cufflinks_results.zip

# Download the results of salmons
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/salmon_results.zip
unzip salmon_results.zip

Run the codes below in R to do normalization and differential expression analysis.

# set the project path
projPath=getwd();
setwd(projPath)

# Normalize the data of three methods XAEM, Salmon and Cufflinks
source("Isoform_Expression_CPM_Normalization.R")

# Do DE analysis and plot figures
source("DEanalysis_plots.R")

# output: DE_Analysis.png

The results of the differential expression analysis (Figure 1 below) are the plots (DE_Analysis.png) reproducing Figure 3 of the XAEM paper. Note that due to the randomness of 50 times’ run, the figure might be slightly different from the figure in the paper.

Figure 1. Detection and validation of differentially expressed (DE) isoforms using the MDA- MB-231 scRNA-seq dataset. XAEM, Salmon and Cufflinks are presented in blue-solid, red-dashed and grey-dotted lines, respectively. The x-axis shows the number of top DE isoforms in the training set; the y-axis is the proportion of rediscovery in the validation set. The rediscovery rate (RDR) is calculated by comparing the top 100, 500 and 1000 DE isoforms from the training set with all the significant DE isoforms from the validation set. The boxplots show the RDR from 50 times’ run. (a) Both training set and validation set are constructed using cells from batch 1. The quantification of XAEM, Salmon and Cufflinks is performed without bias correction. (b) The quantification from the three methods are bias- corrected. (c) The training set is constructed using cells from batch 1, while the validation set uses cells from batch 2. The RDR is calculated for only singleton isoforms. (d) The training set is constructed using cells from batch 1, and the validation set using cells from batch 2. The RDR is calculated using only non-paralogs.

References:

Deng, Wenjiang, Tian Mou, Nifang Niu, Liewei Wang, Yudi Pawitan, and Trung Nghia Vu. 2019. “Alternating EM Algorithm for a Bilinear Model in Isoform Quantification from RNA-Seq Data.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btz640.

Driver Genes

Computing driver-gene score using integrated genomics and transcriptomics profiles of tumor and matched-normal tissues

Content

Introduction
Flowchart
Example usage
References

1. Introduction

All cancers arise as a result of somatically acquired changes in the DNA of cancer cells, yet not all somatic abnormalities found in a cancer genome are involved in tumor development (carcinogenesis). Passenger mutations occur by genetic hitchhiking in an unstable environment and they have no effect on the fitness of a clone or the tumor progression. Distinguishing driver mutations actively involved in carcinogenesis from passenger mutations is a key step to understand the mechanism of tumor emergence and evolution, and to determine potential therapeutic targets.

With the progress of the sequencing methods full profiles of tumor genomes and exomes can now be obtained which further enables unbiased search for drivers mutations. Methods based on functional and/or frequency approaches are being proposed, e.g. DOTS-Finder, MuSiC, OncodriveFM.

DNA and RNA sequencing of tumor and matched-normal tissues from the same patient allows accurate detection of somatic mutations and differential expression at isoform-level. To integrate mutation, expression and functional data from the different omics data, we are developing a pipeline to calculate driver-gene score (DGscore). To contribute to the score, a gene has to satisfy four key properties:

frequently mutated
with mutational impact on protein coding gene,
exhibit a large tumour-normal differential expression and
functionally linked to many differentially-expressed neighbours in a functional gene network, such as protein-protein interaction network.

2. Flowchart

We have developed a computational approach to calculate driver-gene scores using existing bioinformatics tools as well as tools previously developed by our group and a set of in-house written bash, python and R scripts. For learning purposes the approach can be viewed in five steps as shown in the gray boxes on Figure 1. Briefly, these include 1): estimating gene- and isoform expression using joint statistical model accounting for non-uniform isoform-specific read distribution as implemented in our Sequgio pipeline; 2) preprocessing BAM files; 3) assessing statistical significance of association between alleic counts and tumor/normal status for somatic variant calling; 4) predicting impact of somatic mutations on protein coding genes; 5) integrating gene- and isoform expressions and predicted somatic mutations to derive driver genes z-scores using Network Enrichment Analysis.

Below, we explain each step in more details, providing examples of usage and links to more detailed description. We also provide ready-to-download scripts, where applicable, to speed-up learning and setting-up process. It is recommended to run all calculation using high performance computing cluster e.g. UPPMAX for users in Sweden. Although our approach is not system dependent, the majority of the scripts has been developed and tested using UPPMAX and we will therefore assume access to UNIX environment in the below description. To run our pipeline effectively, various bioinformatics approaches and tools have to be understood. Unless one is already familiar with GATK and snpEff and our previously developed methods, Sequgio and Network Enrichment Analysis, setting-up the full run may take several days. As the approach is build upon many tools that we feel are best to be put together by the user, we focus on providing the essential know-how on using our approach, and assume basic bioinformatics knowledge.

The complete run for one patient, with 6GB tumor and matched-normal .BAM files (each) from WXS sequencing (WXS-seq) and 5GB tumor and matched-normal .BAM files (each) from RNA sequencing (RNA-seq) can be completed in 24h on UPPMAX using 4 cores.

Figure 1. Flowchart of computational identification of the driver genes based on the tumor and matched-normal RNA-seq and Exome-seq tissue data; AGS, altered gene sets; FGS, functional gene set. Step 1: estimating gene- and isoform expression using joint statistical model accounting for non-uniform isoform-specific read distribution as implemented in our Sequgio pipeline; Step 2: preprocessing BAM files; Step 3: assessing statistical significance of association between allelic counts and tumor/normal status for somatic variant calling; Step 4: predicting impact of somatic mutations on protein coding genes; Step 5: integrating gene- and isoform expressions and predicted somatic mutations to derive driver genes z-scores using Network Enrichment Analysis.

3. Example usage

To use the Driver Genes approach, you need 4 BAM files per patient, including tumor and matched-normal WXS-seq data and tumor and matched-normal RNA-seq. These files should be at least sorted and indexed, but we also recommend marking duplicates and removing reads with 0 quality score. Example script showing preparation of the BAM files is under our SOMAC wiki post (More on BAM processing).

Step 1: Estimating gene- and isoform expression (Sequgio)

Sequgio uses a joint statistical model that accounts for non-uniform isoform-specific read distribution when estimating gene isoform expression. Sequgio is implemented as R package, available from the GitHub and soon to be available as part of the Bioconductor. Example usage and data have been released with the Sequgio publication [1] and are described in details in another post of our Wiki. Briefly, using RNA-seq data (both tumor and normal for each patient) we 1) create annotation template, 2) create design matrix, 3) import BAM files and create a count matrix and 4) fit the model to obtain gene-isoform expression estimates.

Here you can download R script containing the example usage and data as released with Sequgio publication. The script contains explanatory comments.

Download sequgio.R script: sequgio.R

Step 2: Preprocessing BAM files (GATK)

Variant calling is not error rate free, with major sources of errors including erroneous realignment in low-complexity regions and incomplete reference genome with respect sample. It could be therefore good practice to include post-alignment BAM processing steps, including realignment and base quality recalibration, especially when using variant callers not performing local realignment.

We used GATK for post-alignment and base quality recalibration and we refer you to well-documented GATK website for details. However, the example bash script can be found below.

This step can be also performed using our SOMAC v01 pipeline, as described in Step 3: Statistical significance of association between allelic counts and tumor/normal status for somatic variant calling (SOMAC).

#!/bin/bash

# GATK realignment and base quality recalibration example.sh script
 # To be run from the directory where example.bam and example.bam.bai is located 

# Specify pathway to GATK, ref.fasta and indels.vcf
> path2GATK="/sw/apps/bioinfo"
> refFasta="/proj/b2014296/DATA/FILES/Homo_sapiens_assembly19.fasta"
> refKnown="/proj/b2014296/DATA/FILES/Mills_and_1000G_gold_standard.indels.b37.vcf"

# GATK (3.3.0) RealingerTargetCreator
> java -Xmx2g -jar $path2GATK/GATK/3.3.0/GenomeAnalysisTK.jar -T RealignerTargetCreato -R $refFasta -I example.bam -o forIndelRealinger.intervals -known $refKnown

# GATK (3.3.0) IndelRealinger
> java -Xmx2g -jar $path2GATK/GATK/3.3.0/GenomeAnalysisTK.jar -T IndelRealigner -R $refFasta -I example.bam -targetIntervals forIndelRealigner.itervals -o realigned.bam

# GATK (3.3.0) Base Recalibrator
> java -Xmx2g -jar $path2GATK/GATK/3.3.0/GenomeAnalysisTK.jar -T BaseRecalibrator -I realigned.bam -R -knownSites $refKnown -o recal.grp 

# GATK (3.3.0) PrintReads
> java -Xmx2g -jar $path2GATK/GATK/3.3.0/GenomeAnalysisTK.jar -T PrintReads -BQSR recal.grp -o recalibrated.bam -I reca.grp

Download example post alignment BAM processing .sh script here: GATK_example.sh

Step 3: Statistical significance of association between allelic counts and tumor/normal status for somatic variant calling (SOMAC)

The next step after processing of the whole exosome sequencing data is to call somatic variants using the paired tumor and normal samples. To this end we have developed SOMAC pipeline that is based on standard parametric statistical tests (chi-square) and novel multidimensional false discovery rate (FDR) estimation (Ploner et al. 2006) The pipeline is described in details here. As part of this pipeline it is possible to automatically run bam post-alignment processing (realignment and base quality recalibration as described in step 2).

# Running SOMAC
> bash path/to/somac config.TN inputFiles.txt partition.txt

SOMAC output is a table of potential somatic variants including chi-square, fdr2d and LRT statistics (in .txt and .RData) and .vcf file containing the potential somatic variants. These can be further filtered using desired statical criteria, e.g. chi-square < 0.001.

Download SOMAC scripts: Somac_v01

Step 4: Predicting impact of somatic mutations on protein coding genes (snpEff)

To identify the driver genes, we focus on significant high and moderate impact mutations, that is mutations leading for instance to exon deletions, frame shifts or codon insertion and deletions. To identify these we used snpEff, a variant annotation and effect prediction tool (Cingolani et al. 2012). For example, significant isoforms (top 1000 ranked using chi-square and chi-square < 0.001) mutated in at least ten patients and predicted to have high or moderate impact in more than two patients were kept on the list of potential driver genes when searching for driver genes predictive of breast cancer survival based on the 60 paired tumor and normal TCGA samples.

# Running snpEff
> java -Xmx4g -jar /path/to/snpEff.jar GRCh37.75 input.vcf > output.vcf

Download snpEff here & more on snpEff can be found here

Step 5: Integrating gene- and isoform expressions and predicted somatic mutations to derive driver genes z-scores (Network Enrichment Analysis)

Network Enrichment Analysis (NEA) extends the overlap statistics in Gene-set enrichment analysis to network links between genes in the experimental set and those in the functional categories (Alexeyenko et al. 2012). We use it to uncover potential driver genes that have strong network connectivity with a list of altered gene set (AGS). Each mutated gene is refereed to as functional gene set (FGS) and the AGS are derived from the differentially expressed genes and isoforms (Sequgio). A quantitative enrichment z-score is calculated to assess statistical significance of over-represented direct links between AGS and FGS; the z-score allows to prioritise mutated genes (Alexeyenko et al. 2012).

To identify common-driver genes we set AGS as genes with differentially expressed isoform levels between tumor and normal.

To identify patient-specific driver genes we set AGS as genes with large tumor/normal expression ratio within each patient.

# Running NEA in R
> require(nea)
> res.MNEA <- nea(ags=AGS, fgs=FGS, fgslib=NULL, network=mnet, pnet=NET, nperm=50)

where: 
ags: a vector of altered genes or isoforms, e.g. obtained from Sequgio
fgs: a list of functional gene sets, e.g. obtained from mutation effect prediction
fgslib: a character of the name of annotation data
network: a vector of gene pairs representing the network link
pnet: a list of randomly permuted networks
nperm: number of permutations

# For more detailed usage
> ?nea

Download NEA here and follow fully worked-out example in Section C of Additional file 1 (Alexeyenko et al. 2012)

Questions?

yudi.pawitan@ki.se

References

Alexeyenko, Andrey, Woojoo Lee, Maria Pernemalm, Justin Guegan, Philippe Dessen, Vladimir Lazar, and Janne Lehti. 2012. “Network Enrichment Analysis : Extension of Gene-Set Enrichment Analysis to Gene Networks.” BMC Bioinformatics 13: 226.

Cingolani, Pablo, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J Land, Xiangyi Lu, and Douglas M Ruden. 2012. “A Program for Annotating and Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff: SNPs in the Genome of Drosophila Melanogaster Strain w1118; Iso-2; Iso-3,” no. June: 80–92.

Ploner, Alexander, Stefano Calza, Arief Gusnanto, and Yudi Pawitan. 2006. “Multidimensional Local False Discovery Rate for Microarray Studies.” Bioinformatics (Oxford, England) 22 (5): 556–65.

Suo, Chen, Stefano Calza, Agus Salim, and Yudi Pawitan. 2013. “Joint Estimation of Isoform Expression and Isoform-Specific Read Distribution Using Multi-Sample RNA-Seq Data,” 1–8.

TCGA data

TCGA

This post collects all the information regarding accessing data from TCGA Database.

Data Manifest

Manifest data store summary information on data available in TCGA. The tab separated file can be download as follow

wget https://cghub.ucsc.edu/reports/SUMMARY_STATS/LATEST_MANIFEST.tsv

Sequgio

An Introduction to Sequgio

Contents

1 Introduction

2. Installing Sequgio

3 Example data

3.1 Step 1: creating the annotation template (txdb)

3.2 Step 2: creating the design matrix object

3.3 Step 3: importing BAM files to create counts matrix

3.4 Step 4: fiting models

4. Other comments

4.1 Fixing QNAME

4.2 Alternative counting method

1 Introduction

This document provides a brief guide to the Sequgio package, which is a package for gene isoform expression estimation, using a joint statistical model accounting for non-uniform isoform-specific read distribution (Suo et all. 2014). There are four components to this package:

constructing annotation template txdb
constructing design matrices used in expression estimation
importing BAM files to create counts matrix
estimating the expression levels, and optionally the read distribution and standard error of the expression estimates

Read intensity in the RNA-sequencing data is often not uniform, in which case the currently standard methods of gene and isoform expression estimates result in biased estimates. Sequgio accounts for non-uniform isoform-specific read distribution, yielding improved gene isoform expression estimation. A statistical regularization with L₁ smoothing penalty is imposed to control the estimation. Also, for estimability reasons, the method uses information across samples from the same gene (Suo et all. 2014).

2 Installing Sequgio

Sequgio methods is implemented as an R package and will be soon available at bioconductor.org. In the meantime please use our developer version found on GitHub

Obtaining Sequgio package from GitHub (available now: 2015/01/20) From R execute below commands

install.packages("devtools")
library(devtools)
dev_mode(on=T)
install_github("Sequgio","Senbee",ref = "Stable")
dev_mode(on=F)

3 Example data

We show the functionality Sequgio package using RNA sequencing samples provided in the RNAseqData.HNRNPC.bam.chr14. For demonstration purposes, Sequgio provides annotation template object (txdb, obtained in Step1) and design matrix (Design, obtained in Step2).

> library(RNAseqData.HNRNPC.bam.chr14)
> library("TxDb.Hsapiens.UCSC.hg19.knownGene")
> data ("TxDb")
> data ("Design")

For better performances the package supports parallel computing via the BiocParallel package which is loaded automatically. For parallel processing set the parameters to the ones suiting your platform. We will use sequencial computation here. 
> param = SerialParam()

3.1 Step 1: creating annotation template (txdb)

The first step is providing TranscriptDb object that stores transcript annotation information. Here, we will use the one provided by the package TxDb.Hsapiens.UCSC.hg19.knownGene.

The TranscriptDb objects is then preprocessed to generate disjoint regions using the reshapeTxDb function. To obtain accurate results, set the read length parameter to match your experiment (72 nucleotides for the example data).

Further, since the example data contain only chromosome 14 we limit TranscriptDb to this chromosome as well (reducing computational burden).

# Example data
> seqs = seqnames(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene))
> sel =  rep(FALSE,length(seqs))
> names(sel) =  seqs
> sel["chr14"] = TRUE
> isActiveSeq(TxDb.Hsapiens.UCSC.hg19.knownGene) =  sel
> txdb = reshapeTxDb(TxDb.Hsapiens.UCSC.hg19.knownGene,probelen = 72L, mcpar=param)

# For the real data obtaining txdb could look like below
> txdb.ucsc = makeTranscriptDbFromGFF(file="pathway/to/genes.gft", format="gtf", dataSource=paste(http://cufflinks.cbcd.umd.edu/igenomes.html", sep="")), species="hg19.2014-06-02-13-47-56")
> seqs = seqnames(seqinfo(txdb.ucsc))
> sel = rep(TRUE, length(seqs))
> names(sel) = seas
> isActiveSeq(txdb.ucsc)=sel
> txdb = reshapeTxDb(txdb.ucss, probelen = 48L, mcpar=param_core)

This step has to be repeated only if the experimental parameters change, e.g. annotation database, reads length, with/without junctions etc.

If the user wants to focus on a subset of the “gene/features” in the genome, this can be done using the argument include.only. This takes a GRanges object as input. The purpose is to limit resources usage at BAM files input: this will be restricted only to the specific regions supplied.



# Let's consider only two genes for example

> GENES = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)[c(1,3)]
> txdb = reshapeTxDb(TxDb.Hsapiens.UCSC.hg19.knownGene,probelen = 72L, 
            include.only=GENES, mcpar=param)

The ranges can be also be supplied later in the pipeline (see setCounts).

3.2 Step 2: creating design matrix

The second step is to create design matrices for each “transcriptional unit” (see references). These matrices will be used in the fitting procedure.

Several parameters can be tuned. The main one regards which kind of library is the experiment using: paired-end (“PE”) or single-end (“SE”, the default). This can be set with the method argument. The required mulen argument provides an estimate of the average fragment length.

Here we will use paired-end data.

> Design <- makeXmatrix(txdb, method="PE", mulen=155, sd=50, mcpar=param)

This step has to be only performed once, as step 1.

2.3 Step 3: importing BAM files and create counts matrix

Models are fit based on a matrix with read counts for every region in every sample. We will now import the aligned read counts in BAM files into Ras object called ’allCounts’. To do so we create a target object storing the filenames (with full path) and sample names to be used for count matrix headings. If BAI file are available, they can be provided in the target object.

The resulting object (allCounts) will count for every exons the overlapping reads.

Let’s first create the BamFileList object pointing to the BAM files. Data are pair-end so we set asMates to be TRUE.

> bflist <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES, asMates=TRUE)

Then we create a seqCounts object that will store some informations used in counting. A file name for file backed matrix can be provided or a random one will be generated. Also the root directory where the files are stored can be specified (defaults to working directory).

If the backing files (*.bin and *.desc) are then moved elsewhere, the files location informations in the seqCounts will have to be updated.

By default the function seqCounts will save the created object in the working directory, with the same fileName as *.bin and *.desc files. If this file is lost, it can be recreated allowing to link again to the *.bin file. To do specify seqCounts(…,existDesc=”fileName.desc”). A control on new exons/samples length will be performed but not on sample and exon names.

> seqObj <- setCounts(bflist,txdb,fileName="test")

If the user would like to reduce I/O resources usage when reading in BAM files, this can be restricted to specific region of interest specifying a set of ranges using includeOnly


 #This would show if any subset is already set
> includeOnly(seqObj)
 # Here we set it (see reshapeTxDb for GRanges definition example) 
> includeOnly(seqObj) = GENES
 # Or if we would like to remove it 
> includeOnly(seqObj) = NULL

We then perform the actual counting with the function doCounts (no return value). This function saves the counts in the shared/filebacked big.matrix object.

> doCounts(seqObj,mcpar=param)

We can finally import in RAM the counts creating a standard R matrix.

> allCounts <- getCounts(seqObj)

In case the same big.matrix is used for more counting it should be reset to have counts zero.

To do so use the resetCounts function

> resetCounts(seqObj)

Again for ease of computation counts object is already provided.

> data(“Counts”)

We can see how many read we counted

> colSums(allCounts)

The counting procedure can be performed on single BAM files too or a subset of a BamFileList

For a single file we can proceed as follows

> bfl <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1],asMates=TRUE)
> seqObj1 <- setCounts(bfl,txdb,fileName='test')

and then perform the following steps as described.

For a subset of a seqCounts we can proceed as described in the following code, using columns indicators (integers)

> doCounts(seqObj1,mcpar=param,which.sample=c(1L,4L))

or similarly using sample names

> doCounts(seqObj1,mcpar=param,which.sample=c('ERR127306','ERR127309'))

This last procedure would allow to have different nodes of an HPC platform to count a subset of the samples i parallel (each nodes get a different which.sample vector), while each node can use multiple cores to count individual samples. All the counts will be stored in the same shared matrix.

It also possible to subset a single sample (or many samples) based on genomic position. This can be achieved using filterBam function in Rsamtools package.

For example to read in the whole chr22 (in a human genome) we could proceed as follow

bfl <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1],asMates=TRUE,
               yieldSize=100000)
seqObj <- setCounts(bfl,txdb,fileName="test")
lenChr22 = seqlengths(txdb)['chr22']
bam.params = ScanBamParam(simpleCigar = FALSE, reverseComplement = FALSE, 
                what = c("qname", "qwidth", "mapq"), 
                flag = scanBamFlag(isPaired = TRUE, isUnmappedQuery = FALSE, 
                isDuplicate = FALSE,isNotPassingQualityControls = FALSE, 
                hasUnmappedMate = FALSE),
                which = RangesList(chr22=IRanges(1,lnChr22))))

tmpFile = filterBam(bfl,tempfile(),param=bam.params)
doCounts(seqObj,mcpar=param)

2.3.1 Fixing QNAME

Some preprocessing pipelines deliver BAM files for paired-end reads where every mate has a slightly different QNAME. E.g. using the SRA toolkit to convert to fastq (fastq-dump), it will generate out two fastq files, and the QNAME in each of the fastq files will be appended with a “.1” for the first pairs and a “.2” for the second pairs. Similarly we might find that pairs have a different prefix.

The matching procedure implemented in Rsamtools requires mates to have the same QNAME, so a trimming is required. This can be achieved when declaring the BamFile or BamFileList object using the arguments ’qnamePrefixEnd’ or ’qnameSuffixStart’.

Unique qnames aren’t a problem in this sample file – just using it to demonstrate the usage.

First no trimming

> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what="qname") 
> bf <- BamFile(fl, asMates=TRUE)
> scanBam(bf, param=param)[[1]]$qname[1:3]

then trim prefix

> qnamePrefixEnd(bf) = "_"
> scanBam(bf, param=param)[[1]]$qname[1:3]

and now trim suffix also

> qnameSuffixStart(bf) = ":"
> scanBam(bf, param=param)[[1]]$qname[1:3]

In real situation the easiest approach is to specify trimming character directly in the call to BamFile or BamFileList.

For example in TCGA data (e.g. BRCA and LUNG) we have the following QNAME pattern for a pair of reads:

UNC14-SN744_125:6:2108:11845:164113/1

UNC14-SN744_125:6:2108:11845:164113/2

so we need to trim everything after (including) “/”

We would then define the BamFile object (or list if more than one) as follow

> bf <- BamFile(Filename,asMates=TRUE,qnameSuffixStart="/") ## single file
> bfl <- BamFileList(Filenames,asMates=TRUE,qnameSuffixStart="/") ## more than one BAM

## ALTERNATIVE WAY FOR STEP 3 WITH BASH-PYTHON (without fixing Qname) ##

This alternative method takes ~15 min=1 core=1 BAM file, and each BAM file can be counted in parallel (e.g. 12 samples = 12 parallel batch jobs)

1. Save your txdb into a text file.

> write.table(as.data.frame(txdb), "mytxdb.txt", sep="\t", row.names=FALSE, col.names=FALSE, quote=FALSE)

We already have pre-compiled TXDB for general purpose: /proj/b2012036/Dhany/Sequgio/txdb_hg19.txt and /proj/b2012036/Dhany/Sequgio/txdb_grch.txt

2. Write the list of annotated exon-pairs.

Do this, so that we can ignore novel exon pairs in this counting step. Make sure you still have the txdb variable in R:

> ex_list <- split(values(txdb@unlistData)$exon_name,values(txdb@unlistData)$tx_name)
> reg_vec <- sapply(split(values(txdb@unlistData)$region_id,values(txdb@unlistData)$tx_name),function(x) x[1])
> sizes_ex_list <- sapply(ex_list,length)
> n.exons <- sum((sizes_ex_list^2+sizes_ex_list)/2)
> exons.names <- unique(.Call("makeExNames",ex_list,reg_vec,as.integer(n.exons)))
> write.table(exons.names, "exons19.txt", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")

We already have pre-compiled list of annotated exon-pairs for general purpose: /proj/b2012036/Dhany/Sequgio/exons19.txt and /proj/b2012036/Dhany/Sequgio/exonsGrch.txt

3. Preview your merged BAM file.

Go to your bash command line, either open a new bash terminal or quit R. Find the pattern for the headers (i.e. first column) and answer the following questions:
(i) Can we tell which one is from pair 1 and 2? (0=no, 1=yes)
(ii) What’s the delimiter that separate headers and pair identifier?
(iii) Is the header on the left or right of the delimiter? (0=left, 1=right)


> samtools view -f 2 merged.bam | head
UNC17-SN1277:47:C0Y3YACXX:4:2205:18544:178539/1 81      chr1    10551   57      48M   ...
UNC17-SN1277:47:C0Y3YACXX:4:2208:7653:169627/2  163     chr1    11193   54      48M   ...

In the above example, (i) yes (=1) (ii) / (iii) left (=0). Thus, we define the “sep” parameter (separated by comma) as: sep=1,/,0

Beware that in some cases, the pair identifier is not /1 and /2. You must decide whether there’s separator or not.

4. List which chromosomes you want to count.

First, you need to know the chromosome names of the BAM files: 1,2,…,X,Y, or chr1,chr2,…,chrX,chrY. The names can be found on the second column of the BAM view (see above). Second, you need to mention which chromosomes you want to work with because there are other chromosomes in the BAM file, e.g. chrM, patch/putative chromosomes. To include those other chromosomes, you need to check whether your txdb recognizes those other chromosomes, too.

5. Run the Sequgio counting.

Run it in a BASH command line (still outside R). Specify the merged BAM file, chromosomes, txdb file, and output file. For >1 sample, it’s recommended to run in parallel (see Uppmax), not in serial. Never run many samples without batch/interactive session! Here is an example of counting for 1 sample:


> cd /proj/b2012036/TCGA/BRCA_BAM_files/
> bash /proj/b2012036/Dhany/Sequgio/newsequgiocount.sh file=UNCID_1120271.5969d3bb-19fb-4bda-9483-db4791257b95.sorted_genome_alignments.bam sep=1,/,0 chr=chr1-22,chrX,chrY txdb=/proj/b2012036/Dhany/Sequgio/txdb_hg19.txt output=UNCID_1120271.5969d3bb-19fb-4bda-9483-db4791257b95.sorted_genome_alignments.bam.txt

6. Join the counts of several samples

Several samples need to be counted so that you can continue to the next step. After joining the sample counts into 1 file, you can come back to R.

The syntax is: python makeAllcountMatrix.py [list of annotated exon-pairs (see no. 2)] [countfile_1] … [countfile_N] [samplename_1] … [samplename_N] > out.txt


> cd /proj/b2012036/TCGA/BRCA_BAM_files/
> python /proj/b2012036/Dhany/Sequgio/makeAllcountMatrix.py /proj/b2012036/Dhany/Sequgio/exons19.txt UNCID_1310380.5db1c37a-2329-4bd3-baf2-89d26a8959fa.sorted_genome_alignments.bam.txt UNCID_1311429.2efa4fd0-1c0b-49c8-a11f-606d8618ec97.sorted_genome_alignments.bam.txt UNCID_1313654.c14f7362-25c3-4e1a-acd7-871bb22ea8a3.sorted_genome_alignments.bam.txt UNCID_1314025.4543cecd-6333-473b-b5ce-96d33ab47706.sorted_genome_alignments.bam.txt UNCID_1338780.aaec0cbc-a493-48f4-9cc7-32629880ee24.sorted_genome_alignments.bam.txt UNCID_1338780.aaec0cbc-a493-48f4-9cc7-32629880ee24.sorted_genome_alignments.bam.txt UNCID_1367534.c7d2b685-5f11-4879-82d5-e983b5dbb07d.sorted_genome_alignments.bam.txt 1310380 1311429 1313654 1314025 1338780 1338780 1367534 > sample.counts
> allCounts

2.4 Step 4: fit models

Using the region(exons)-by-sample counts matrix (allCounts) and the design matrices object (Design) we can now fit models.

# model fitting
data(Counts)
data(Design)
iGenes <- names(Design) 
# Fit a single transcriptional unit (one element in Design) 
fit1 <- fitModels(iGenes[22],design=Design,counts=allCounts)
# More than one using list/for loops/mclapply/etc
fit2 <- lapply(iGenes[21:22],fitModels,design=Design,counts=allCounts)
# Fit all 
fitAll<-bplapply(iGenes, function(x){fitModels(x,design=Design,counts=allCounts)},BPPARAM=param)
})

It is possible to get gene-level expression values adding up isoform estimates within gene. To do so proceed as follow



> sumByGene(fitAll,txdb)

where aFit is a list containing estimates (output of fitModels) and txdb is either a data.frame or a GRangesList (as returned by reshapeTxDb). The required columns are “gene_id” and “tx_name”.

References

1]Suo C, Calza S, Salim A and Pawitan Y (2014). Joint estimation of isoform expression and isoform-specific read distribution using multisample RNA-Seq data. Bioinformatics, 30, pages 506–513