Author Archives: truvu

XAEM_v0.1.1

Contents

This is the webpage of XAEM version 0.1.1. The most updated version of XAEM is here:
https://www.meb.ki.se/sites/biostatwiki/xaem/

1. Introduction
2. Download and installation
3. XAEM: step by step instruction and explanation
3.1 Preparation for the annotation reference
3.2 Quantification of transcripts
4. A practical copy-paste example of running XAEM
5. Dataset for differential expression (DE) analysis

1. Introduction

This document shows how to use XAEM [Deng et al., 2019] to quantify isoform expression for multiple samples.

What are new in version 0.1.1

  • Add standard error for the estimates
  • Fix a small bug when separe a CRP into more than 1 CRP due to H_thres
  • Fix a small bug in function crpcount() to avoid the error when having only 1 CRP

Older versions

Software requirements for XAEM:

  • R version 3.3.0 or later with installed packages: foreach and doParallel
  • C++11 compliant compiler (g++ >= 4.7)
  • XAEM is currently tested in Linux OS environment

Annotation reference: XAEM requires a fasta file of transcript sequences and a gtf file of transcript annotation. XAEM supports all kinds of reference and annotation for any species. In the XAEM paper,  we use the UCSC hg19 annotation:

  • Download the sequences of transcripts:transcripts.fa.gz
  • Download the annotation of transcripts: genes_annotation.gtf.gz
  • Download the design matrix X of this annotation:  X_matrix.RData (X matrix is an essential object for bias correction and isoform quantification, see Section 4.1.2 for more details)
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz
content/uploads/sites/4/XAEM_datasources/genes_annotation.gtf.gz
gunzip genes_annotation.gtf.gz
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

2. Download and installation

If you use the binary version of XAEM (recommended):

  • Download the latest binary version from XAEM website:
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/XAEM-binary-0.1.1.tar.gz
  • Uncompress to folder
tar -xzvf XAEM-binary-0.1.1.tar.gz
  • Move to the XAEM_home directory and do the configuration for XAEM
cd XAEM-binary-0.1.1
bash configure.sh
  • Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/XAEM-binary-0.1.1/lib:$LD_LIBRARY_PATH
export PATH=/path/to/XAEM-binary-0.1.1/bin:$PATH

If you want to build XAEM from sources:

  • Download XAEM  and move to XAEM_home directory
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2020/12/XAEM-source-0.1.1.tar_.gz
tar -xzvf XAEM-source-0.1.1.tar_.gz
cd XAEM-source-0.1.1
bash configure.sh
  • XAEM requires information of flags from Sailfish including DFETCH_BOOST, DBOOST_ROOT, DTBB_INSTALL_DIR and DCMAKE_INSTALL_PREFIX. Please refer to the Sailfish website for more details of these flags.
  • Do installation by the following command:
DBOOST_ROOT=/path/to/boostDir/ DTBB_INSTALL_DIR=/path/to/tbbDir/ DCMAKE_INSTALL_PREFIX=/path/to/expectedBuildDir bash install.sh
  • After the installation is finished, remember to add the paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/expectedBuildDir/lib:$LD_LIBRARY_PATH
export PATH=/path/to/expectedBuildDir/bin:$PATH

Do not forget to replace “/path/to/” by your local path.

3. XAEM: step by step instruction and explanation

XAEM mainly contains the following steps:

  • Preparation for the annotation reference:  to process the annotation of transcripts to get essential information for transcript quantification. This step includes 1) index transcript sequences and 2) Construct the design matrix X.
  • Quantification of transcripts:  to get input from multiple RNA-seq samples to do quasi-mapping, generate data for quantifying transcript expression. This step consists of 1) generate equivalence class table; 2) create Y count matrix and 3) estimate transcript expression using AEM algorithm to update the X matrix and transcript (isoform) expression.

3.1 Preparation for the annotation reference

3.1.1 Indexing transcripts

Using TxIndexer to index the transcript sequences in the reference file (transcripts.fa). For example:

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz
TxIndexer -t /path/to/transcripts.fa -o /path/to/TxIndexer_idx

 3.1.2 Construction of the X matrix (design matrix)

This step constructs the X matrix required by the XAEM pipeline. For users working with human annotation of UCSC hg19  the X matrix can be downloaded here: X_matrix.rdata (need to rename the file to X_matrix.RData).

Given file transcripts.fa containing the transcript sequences of an annotation reference, we construct the design matrix as follows.

  • a) Generate simulated RNA-seq data using the R-package “polyester”
## R-packages of "polyester" and "Biostrings" are required
Rscript XAEM_home/R/genPolyesterSimulation.R /path/to/transcripts.fa /path/to/design_matrix
  • b) Run GenTC to generate Transcript Cluster (TC) using the simulated data. GenTC will generate an eqClass.txt file as the input for next step.
GenTC -i /path/to/TxIndexer_idx -l IU -1 /path/to/design_matrix/sample_01_1.fasta -2 /path/to/design_matrix/sample_01_2.fasta -p 8 -o /path/to/design_matrix
  • c) Create a design matrix using buildCRP.R. The parameter setting for this function is as follows.
    • in: the input file (eqClass.txt) obtained from the last step.
    • out: the output file name (*.RData) which the design matrix will be saved.
    • H: (default H=0.025) is the threshold to filter false positive neighbors in each X matrix. (Please refer to the XAEM paper, Section 2.2.1)
Rscript XAEM_home/R/buildCRP.R in=/path/to/design_matrix/eqClass.txt out=/path/to/design_matrix/X_matrix.RData H=0.025

 3.2 Quantification of transcripts

Suppose we already created a working directory “XAEM_project” (/path/to/XAEM_project/) for quantification of transcripts.

 3.2.1 Generating the equivalence class table

The command to generate equivalence class table for each sample is similar to “sailfish quant”.  For example, we want to run XAEM for sample1 and sample2 with 4 cpus:

XAEM -i /path/to/TxIndexer_idx -l IU -1 s1_read1.fasta -2 s1_read2.fasta -p 4 -o /path/to/XAEM_project/sample1
XAEM -i /path/to/TxIndexer_idx -l IU -1 s2_read1.fasta -2 s2_read2.fasta -p 4 -o /path/to/XAEM_project/sample2
  • If the data is compressed in gz format. We can combine with gunzip for a decompression on-fly:
XAEM -i /path/to/TxIndexer_idx -l IU -1 <(gunzip -c s1_read1.gz) -2 <(gunzip -c s1_read2.gz) -p 4 -o /path/to/XAEM_project/sample1
XAEM -i /path/to/TxIndexer_idx -l IU -1 <(gunzip -c s2_read1.gz) -2 <(gunzip -c s2_read2.gz) -p 4 -o /path/to/XAEM_project/sample2
3.2.2 Creating Y count matrix

After running XAEM there will be the output of the equivalence class table for multiple samples. We then create the Y count matrix. For example, if we want to run XAEM parallelly using 8 cores, the command is:

Rscript Create_count_matrix.R workdir=/path/to/XAEM_project core=8

3.2.3 Updating the X matrix and transcript expression using AEM algorithm

When finish the construction of Y count matrix, we use the AEM algorithm to update the X matrix. The updated X matrix is then used to estimate the transcript (isoform) expression. The command is as follows.

Rscript AEM_update_X_beta.R workdir=/path/to/XAEM_project core=8 design.matrix=X_matrix.RData isoform.out=XAEM_isoform_expression.RData paralog.out=XAEM_paralog_expression.RData merge.paralogs=FALSE isoform.method=average remove.ycount=TRUE

Parameter setting

  • workdir: the path to working directory
  • core: the number of cpu cores for parallel computing
  • design.matrix: the path to the design matrix
  • isoform.out (default=XAEM_isoform_expression.RData):  the output contains the estimated expression of individual transcripts, where the paralogs are split into separate isoforms. This file contains two objects: isoform_count and isoform_tpm for estimated counts and normalized values (TPM). The expression of the individual isoforms is calculated with the corresponding setting of parameter “isoform.method” below.
  • isoform.method (default=average):  to report the expression of the individual members of a paralog as the average or total expression of the paralog set (value=average/total).
  • paralog.out (default=XAEM_paralog_expression.RData): the output contains the estimated expression of merged paralogs. This file consists of two objects: XAEM_count and XAEM_tpm  for the estimated counts and normalized values (TPM). The standard error of the estimate is supplied in object XAEM_se stored in *.standard_error.RData.
  • merge.paralogs (default=TRUE) (*): the parameter to turn on/off (value=TRUE/FALSE) the paralog merging in XAEM. Please see the details of how to use this parameter in the note at the end of this section.
  • remove.ycount (default=TRUE): to clean all data of Ycount after use.

The output in this step will be saved in XAEM_isoform_expression.RData, which is the TPM value and raw read counts of multiple samples.

Note: (*) In XAEM pipeline we provide this parameter (merge.paralog) to merge or not merge the paralogs within the updated X matrix (please see XAEM paper Section 2.2.3 and Section 2.3).  Turning on (default) the paralog merging step produces a more accurate estimation. Turning off this step can produce the same sets of isoforms between different projects.

4. A practical copy-paste example of running XAEM

This section presents a tutorial to run XAEM pipeline with a toy example. Suppose that input data contain two RNA-seq samples and server supplies 4 CPUs for computation. We can test XAEM by just copy and paste of the example commands.

  • Download the binary version of XAEM and do configuration
# Create a working folder
mkdir XAEM_example
cd XAEM_example
# Download the binary version of XAEM
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/XAEM-binary-0.1.1.tar.gz

# Configure the tool
tar -xzvf XAEM-binary-0.1.1.tar.gz
cd XAEM-binary-0.1.1
bash configure.sh

# Add the paths to system
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
export PATH=$PWD/bin:$PATH
cd ..
  • Download  annotation files and index the transcripts
## download annotation files
# Download the design matrix for the human UCSC hg19 annotation 
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

# Download the fasta of transcripts in the human UCSC hg19 annotation 
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz

## Run XAEM indexer
TxIndexer -t transcripts.fa -o TxIndexer_idx
  • Download the RNA-seq data of two samples: sample1 and sample2
## Download input RNA-seq samples
# Create a XAEM project to save the data
mkdir XAEM_project
cd XAEM_project

# Download the RNA-seq data
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample1_read1.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample1_read2.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample2_read1.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample2_read2.fasta.gz
cd ..
  • Generate the equivalence class tables for these samples
# Number of CPUs
CPUNUM=4

# Process for sample 1
XAEM -i TxIndexer_idx -l IU -1 <(gunzip -c XAEM_project/sample1_read1.fasta.gz) -2 <(gunzip -c XAEM_project/sample1_read2.fasta.gz) -p $CPUNUM -o XAEM_project/sample1

# Process for sample 2
XAEM -i TxIndexer_idx -l IU -1 <(gunzip -c XAEM_project/sample2_read1.fasta.gz) -2 <(gunzip -c XAEM_project/sample2_read2.fasta.gz) -p $CPUNUM -o XAEM_project/sample2
  • Create Y count matrix
# Note: R packages "foreach" and "doParallel" are required for parallel computing
Rscript $PWD/XAEM-binary-0.1.1/R/Create_count_matrix.R workdir=$PWD/XAEM_project core=$CPUNUM design.matrix=$PWD/X_matrix.RData
  • Estimate isoform expression using AEM algorithm
Rscript $PWD/XAEM-binary-0.1.1/R/AEM_update_X_beta.R workdir=$PWD/XAEM_project core=$CPUNUM design.matrix=$PWD/X_matrix.RData isoform.out=XAEM_isoform_expression.RData paralog.out=XAEM_paralog_expression.RData

The outputs are stored in the folder of “XAEM_project” including XAEM_isoform_expression.RData and XAEM_paralog_expression.RData.

5. Dataset for differential expression (DE) analysis

In XAEM paper we have used the RNA-seq data from the breast cancer cell line (MDA-MB-231) for DE analysis. Since the original data was generated by our collaborators and not published yet, we provide the equivalence class table by running the read-alignment tool Rapmap, which is the same mapper of Salmon and totally independent from XAEM algorithm. We also prepare the R scripts and the guide to replicate the DE analysis results in the paper.

In this section, we present an instruction to download the data and run the scripts. We try to build the pipeline following the copy-paste manner in shell, but the part of R scripts must be run in R console.

5.1 Download the R-scripts and the design matrix

This step is to download the R-scripts, change directory to the folder containing the R-scripts and download the design matrix.

# Download R-scripts
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/RDR_brca_singlecell.zip
unzip RDR_brca_singlecell.zip
cd RDR_brca_singlecell

# Download the design matrix
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

5.2 Run XAEM from the equivalence class tables which are the output of read-alignment tool Rapmap

Download the data of equivalence classes

# Download the table of equivalence classes of the single cells which are the output of read-alignment tool Rapmap

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/brca_singlecell_eqclassDir.zip
unzip brca_singlecell_eqclassDir.zip

Run XAEM with the input from the equivalence class table using the R-codes below. Note:  This step takes about 2 hours using a personal computer with 4 CPUs. Users can consider skipping this step and downloading the available XAEM results for the downstream analysis.

# set the project path
projPath=getwd();
setwd(projPath)
source("collectDataOfXAEM.R")

If users want to download the available XAEM results

# Download the available results of XAEM

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/XAEM_results.zip
unzip XAEM_results.zip

5.3 Differential-expression analysis of XAEM and other methods

Download the data of cufflinks and salmon. These files contain the read-count data of methods with and without using bias correction.

# Download the results of cufflinks
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/cufflinks_results.zip
unzip cufflinks_results.zip

# Download the results of salmons
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/salmon_results.zip
unzip salmon_results.zip

Run the codes below in R to do normalization and differential expression analysis.

# set the project path
projPath=getwd();
setwd(projPath)

# Normalize the data of three methods XAEM, Salmon and Cufflinks
source("Isoform_Expression_CPM_Normalization.R")

# Do DE analysis and plot figures
source("DEanalysis_plots.R")

# output: DE_Analysis.png

The results of the differential expression analysis (Figure 1 below) are the plots (DE_Analysis.png) reproducing Figure 3 of the XAEM paper. Note that due to the randomness of 50 times’ run, the figure might be slightly different from the figure in the paper.

Figure 1. Detection and validation of differentially expressed (DE) isoforms using the MDA- MB-231 scRNA-seq dataset. XAEM, Salmon and Cufflinks are presented in blue-solid, red-dashed and grey-dotted lines, respectively. The x-axis shows the number of top DE isoforms in the training set; the y-axis is the proportion of rediscovery in the validation set. The rediscovery rate (RDR) is calculated by comparing the top 100, 500 and 1000 DE isoforms from the training set with all the significant DE isoforms from the validation set. The boxplots show the RDR from 50 times’ run. (a) Both training set and validation set are constructed using cells from batch 1. The quantification of XAEM, Salmon and Cufflinks is performed without bias correction. (b) The quantification from the three methods are bias- corrected. (c) The training set is constructed using cells from batch 1, while the validation set uses cells from batch 2. The RDR is calculated for only singleton isoforms. (d) The training set is constructed using cells from batch 1, and the validation set using cells from batch 2. The RDR is calculated using only non-paralogs.

References: 

  1. Deng, Wenjiang, Tian Mou, Nifang Niu, Liewei Wang, Yudi Pawitan, and Trung Nghia Vu. 2019. “Alternating EM Algorithm for a Bilinear Model in Isoform Quantification from RNA-Seq Data.” Bioinformatics.  https://doi.org/10.1093/bioinformatics/btz640.

circall

A fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data

Contents

1. Introduction
2. Download and installation
3. Prepare BSJ reference database and annotation files
4. Indexing transcriptome and BSJ reference database
5. Run Circall pipeline
6. A practical copy-paste example of running Circall
7. Circall simulator

Update news

19 June 2020: version 0.1.0

  • First submission

1. Introduction

Circall is a novel method for fast and accurate discovery of circular RNAs from paired-end RNA-sequencing data. The method controls false positives by two-dimensional local false discovery method and employs quasi-mapping for fast and accurate alignments. The details of Circall are described in its manuscript. In this page, we present the Circall tool and how to use it.

Software requirements:

Circall is implemented in R and C++. We acknowledge for materials from Sailfish, Rapmap and other tools used in this software.

  • A C++-11 compliant compiler version of GCC (g++ >= 4.8.2)
  • R packages version 3.6.0 or later with the following installed packages: GenomicFeatures, Biostrings, foreach, and doParallel.

Annotation reference

Circall requires

  1. a fasta file of transcript sequences and a gtf file of transcript annotation: can be downloaded from public repositories such as Ensembl (ensembl.org)
  2. a genome file of transcript sequences and a gtf file of transcript annotation: can be downloaded from public repositories such as Ensembl (ensembl.org)
  3. an RData file of supporting annotation: A description of how to create the RData file for new annotation versions or species is available in the following Section.

The current Circall version was tested on the human genome, transcriptome with ensembl annotation version GRCh37.75. Specifically, the following files are required:

Versions

The latest version and information of Circall is updated at: https://www.meb.ki.se/sites/biostatwiki/circall/

2. Download and installation

If you use the binary verion of Circall:

  • Download the latest binary version from Circall website
wget --no-check-certificate -O Circall_v0.1.0_linux_x86-64.tar.gz https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2021/04/Circall_v0.1.0_linux_x86-64.tar_.gz
  • Uncompress to folder
tar -xzvf Circall_v0.1.0_linux_x86-64.tar.gz
  • Move to the Circall_home directory and do configuration for Circall
cd Circall_v0.1.0_linux_x86-64
bash config.sh
cd ..
  • Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/Circall_v0.1.0_linux_x86-64/linux/lib:$LD_LIBRARY_PATH
export PATH=/path/to/Circall_v0.1.0_linux_x86-64/linux/bin:$PATH
  • Do not forget to replace “/path/to/” with your local path or use this command to automatically replace your path:
export LD_LIBRARY_PATH=$PWD/Circall_v0.1.0_linux_x86-64/linux/lib:$LD_LIBRARY_PATH
export PATH=$PWD/Circall_v0.1.0_linux_x86-64/linux/bin:$PATH

If you want to build Circall from sources:

  • Download Circall from Circall website and move to Circall_home directory
wget --no-check-certificate -O Circall_v0.1.0.tar.gz https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2021/04/Circall_v0.1.0.tar_.gz
tar -xzvf Circall_v0.1.0.tar.gz
cd Circall_v0.1.0
bash config.sh
  • Circall requires information of flags from Sailfish including DFETCH_BOOST, DBOOST_ROOT, DTBB_INSTALL_DIR and DCMAKE_INSTALL_PREFIX. Please refer to the Sailfish website for more details of these flags.
  • Do installation by the following command:
DBOOST_ROOT=/path/to/boostDir/ DTBB_INSTALL_DIR=/path/to/tbbDir/ DCMAKE_INSTALL_PREFIX=/path/to/Circall_home bash install.sh

-After the installation is finished, remember to add the paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH

export LD_LIBRARY_PATH=/path/to/Circall_home/lib:$LD_LIBRARY_PATH
export PATH=/path/to/Circall_home/bin:$PATH

Install Circall from sources in Ubuntu

##########################
### This contain scripts in the copy-and-paste manner (line-by-line) to install Circall from source codes
### The scripts have been successfully tested in Ubuntu 16, 19 and 20.

##########################
### download Circall
wget wget --no-check-certificate -O Circall_v0.1.0.tar.gz https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2021/04/Circall_v0.1.0.tar_.gz
tar -xzvf Circall_v0.1.0.tar.gz
cd Circall_v0.1.0

#config to run Circall
bash config.sh

### install boost_1_55_0
wget http://sourceforge.net/projects/boost/files/boost/1.58.0/boost_1_58_0.tar.gz
tar -xvzf boost_1_58_0.tar.gz
cd boost_1_58_0

sudo apt-get update
sudo apt-get install build-essential g++ python-dev autotools-dev libicu-dev build-essential libbz2-dev libboost-all-dev
sudo apt-get install aptitude
aptitude search boost

./bootstrap.sh --prefix=boost_1_58_0_build
./b2
./b2 install

#The Boost C++ Libraries were successfully built!
#add the lib and folder to paths
export LD_LIBRARY_PATH=$PWD/boost_1_58_0_build/stage/lib:$LD_LIBRARY_PATH
export PATH=$PWD/boost_1_58_0_build:$PATH


### install tbb44_20160526oss
cd ..
wget https://www.threadingbuildingblocks.org/sites/default/files/software_releases/source/tbb44_20160526oss_src_0.tgz
tar xvf tbb44_20160526oss_src_0.tgz
sudo apt-get install libtbb-dev

### install cmake for ubuntu: cmake 3.5.1
sudo apt install cmake
### install curl
sudo apt install curl
### install autoconf
sudo apt-get install autoconf
### install zlib
sudo apt install zlib1g-dev
sudo apt install zlib1g
### update all installations
sudo apt-get update

### install Circall
DBOOST_ROOT=$PWD/boost_1_58_0/boost_1_58_0_build/ DTBB_INSTALL_DIR=$PWD/tbb44_20160526oss/ DCMAKE_INSTALL_PREFIX=Circall_0.1.0_build bash install.sh

#The Circall_0.1.0 was successfully built!
###########

#add lib and bin folders to paths
export LD_LIBRARY_PATH=$PWD/Circall_0.1.0_build/lib:$LD_LIBRARY_PATH
export PATH=$PWD/Circall_0.1.0_build/bin:$PATH

#done
###########

3. Prepare BSJ reference database and annotation files

Download genome fasta, transcript fasta and gtf annotation files.

wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh37.75.cdna.all.fa.gz
wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
gunzip Homo_sapiens.GRCh37.75.gtf.gz

Create sqlite

Rscript Circall_v0.1.0_linux_x86-64/R/createSqlite.R Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.sqlite

Create BSJ reference database

The BSJ reference database for Homo_sapiens.GRCh37.75 was generated and able to download from Homo_sapiens.GRCh37.75_BSJ_sequences.fa. This file was generated by the following command:

Rscript Circall_v0.1.0_linux_x86-64/R/buildBSJdb.R gtfSqlite=Homo_sapiens.GRCh37.75.sqlite genomeFastaFile=Homo_sapiens.GRCh37.75.dna.primary_assembly.fa bsjDist=250 maxReadLen=150 output=Homo_sapiens.GRCh37.75_BSJ_sequences.fa

4. Indexing transcriptome and BSJ reference database

Index transcriptome

Circall_v0.1.0_linux_x86-64/linux/bin/TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o IndexTranscriptome

Index BSJ reference database

Circall_v0.1.0_linux_x86-64/linux/bin/TxIndexer -t Homo_sapiens.GRCh37.75_BSJ_sequences.fa -o IndexBSJ

Now, all annotation data are generated and ready to run Circall.

5. Run Circall pipeline

Suppose sample_01_1.fasta and sample_01_2.fasta are the input fastq files. For convenience, we prepared a toy example to test the pipeline, which can be downloaded here:

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2021/07/sample_01_1.fasta_.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2021/07/sample_01_2.fasta_.gz

Circall can be run in one command wrapped in a bash script:

bash Circall_v0.1.0_linux_x86-64/Circall.sh -genome Homo_sapiens.GRCh37.75.dna.primary_assembly.fa -gtfSqlite Homo_sapiens.GRCh37.75.sqlite -txFasta Homo_sapiens.GRCh37.75.cdna.all.fa -txIdx IndexTranscriptome -bsjIdx IndexBSJ -dep Circall_v0.1.0_linux_x86-64/Data/Circall_depdata_human.RData -read1 sample_01_1.fasta.gz -read2 sample_01_2.fasta.gz -p 4 -tag testing_sample -c FALSE -o Testing_out

Inputs and parameters

Annotation data:

  • genome — genome in fasta format
  • gtfSqlite — genome annotation in Sqlite format
  • txFasta — transcripts (cDNA) in fasta format
  • txIdx — quasi-index of txFasta
  • bsjIdx — quasi-index of BSJ reference fasta file

Input data:

  • read1 — input read1: should be in gz format
  • read2 — input read2: should be in gz format

Other parameters:

  • dep — data contain depleted circRNAs: to specify the null data (depleted circRNA) for the two-dimensional local false discovery rate method. For convenience, we collect the null data from three human cell lines datasets Hela, Hs68, and Hek293 and provided in the tool: Circall_v0.1.0_linux_x86-64/Data/Circall_depdata_human.RData
  • p — the number of threads: Default is 4
  • tag — tag name of results: Default is “Sample”
  • td — generation of tandem sequences: TRUE/FALSE value, default is TRUE
  • c — clean intermediate data: TRUE/FALSE value, default is TRUE
  • o — output folder: Default is the current directory

Output

The main output of Circall is provided in *_Circall_final.txt. In this file, each row indicates one circular RNA, and the information of one circular RNA is presented in 8 columns:

  • chr: chromosome
  • start: start position
  • end: end position
  • geneID: gene name that the circRNA belongs to
  • circID: the ID of circRNA in the format “chr__start__end”
  • junction_fragment_count: the number of fragment counts supporting the back-splicing-junction (BSJ)
  • median_circlen: the median length of the circular RNA
  • fdr: the false discovery rate computed from the two-dimensional local false discovery method

6. A practical copy-paste example of running Circall

In this section, we provide a practical example of using Circall in a copy-paste manner for a Hs68 cell line dataset.

Download and install Circall

wget --no-check-certificate -O Circall_v0.1.0_linux_x86-64.tar.gz https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2021/04/Circall_v0.1.0_linux_x86-64.tar_.gz
  • Uncompress to folder
tar -xzvf Circall_v0.1.0_linux_x86-64.tar.gz
  • Move to the Circall_home directory and do configuration for Circall
cd Circall_v0.1.0_linux_x86-64
bash config.sh
cd ..
  • Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=$PWD/Circall_v0.1.0_linux_x86-64/linux/lib:$LD_LIBRARY_PATH
export PATH=$PWD/Circall_v0.1.0_linux_x86-64/linux/bin:$PATH

Download genome fasta, transcript fasta and BSJ databases and annotation file.

# genome from ENSEMBL website
wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

# cDNA (transcript) and Gene annotation (gft) from ENSEMBL website
wget http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.75.cdna.all.fa.gz
gunzip Homo_sapiens.GRCh37.75.cdna.all.fa.gz
wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
gunzip Homo_sapiens.GRCh37.75.gtf.gz

# pre-built BSJ databases from Circall website
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/files/circall/Homo_sapiens.GRCh37.75_BSJ_sequences.fa.gz
gunzip Homo_sapiens.GRCh37.75_BSJ_sequences.fa.gz

# Genarate Sqlite annotation
Rscript Circall_v0.1.0_linux_x86-64/R/createSqlite.R Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.sqlite

Index transcriptome

Circall_v0.1.0_linux_x86-64/linux/bin/TxIndexer -t Homo_sapiens.GRCh37.75.cdna.all.fa -o IndexTranscriptome

Index BSJ reference database

Circall_v0.1.0_linux_x86-64/linux/bin/TxIndexer -t Homo_sapiens.GRCh37.75_BSJ_sequences.fa -o IndexBSJ

Download Hs68 cell line RNA-seq data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR444/SRR444975/SRR444975_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR444/SRR444975/SRR444975_2.fastq.gz

Run Circall

bash Circall_v0.1.0_linux_x86-64/Circall.sh -genome Homo_sapiens.GRCh37.75.dna.primary_assembly.fa -gtfSqlite Homo_sapiens.GRCh37.75.sqlite -txFasta Homo_sapiens.GRCh37.75.cdna.all.fa -txIdx IndexTranscriptome -bsjIdx IndexBSJ -dep Circall_v0.1.0_linux_x86-64/Data/Circall_depdata_human.RData -read1 SRR444975_1.fastq.gz -read2 SRR444975_2.fastq.gz -p 4 -tag testing_sample -o SRR444975

In our experience, it takes around 8 CPU hours with a single CPU in total to complete.

7.  Circall simulator

Introduction

Circall simulator is a tool integrated in Circall to generate RNA-seq data of both circRNA and tandem RNA. The source codes are provided in R/Circall_simulator.R of the Circall tool. The main function of the simulator is Circall_simulator() which is able to be run in R console. This function requires the following parameters:

Parameter setting:

  • circInfo: a data frame that contains 6 columns which are: Chr, start_EXONSTART, end_EXONEND, GENEID, cCount and FPKM. Chr is chromosome name with formated as 1:22, X, Y, Mt. start_EXONSTART is starting position starting exon of circRNA, end_EXONEND is ending position ending exon of circRNA, GENEID is gene ID contains circRNA (used to get gene model), cCount are number of read pair want to generate for the target circRNA and FPKM are Fragments Per Kilobase of transcript per Million of target circRNAs. This is used to simulate circular RNAs
  • tandemInfo: a data frame similar to circInfo to simulate tandem RNAs. tandemInfo=NULL (the default value) to not simulate tandem RNAs
  • error_rate: sequencing error rate, the default value is 0.005
  • set.seed: set seed for reproducibility, the default value is 2018
  • gtfSqlite: path to your annotation file, Sqlite formated (generated by GenomicFeatures)
  • genomeFastaFile: path to your genome fasta file
  • txFastaFile: path to your transcript fasta file (cDNA)
  • out_name: prefix output folders, the default value is “Circall_simuation”
  • out_dir: the directory contains output, the default value is the current directory
  • lib_size: expected library size used when useFPKM=TRUE, the default value is NULL
  • useFPKM boolean value to use FPKM or not, the default value is FALSE. When this useFPKM=TRUE, users need to set value for lib_size, and the simulator will use the abundance in column FPKM of circInfo/tandemInfo for simulation

A toy example for using Circall simulator

For an illustration of using Circall simulator, we provide in this section a toy example. Suppose your current working directory contains the installed Circall and the annotation data. First, we need to load the functions of the simulator into your R console:

source("Circall_v0.1.0_linux_x86-64/R/Circall_simulator.R")

Then we create objects circInfo and tandemInfo containing the information of CircRNAs and tandem RNAs

Chr = c(7,7,3,5,17,4,7,1,3,1,17,12,14,10,18,17,5,20,16,17)

start_EXONSTART = c(131113792,99795401,172363413,179296769,36918664,151509200,2188787,51906019,57832924,225239153,76187051,111923075,104490906,101556854,196637,21075331,74981032,60712420,56903641,80730328)

end_EXONEND = c(131128461,99796580,172365904,179315312,36918758,151509336,2270359,51913807,57882659,225528403,76201599,111924628,104493276,101572901,199316,21087123,74998635,60716000,56904648,80772810)

GENEID = c("ENSG00000128585","ENSG00000066923","ENSG00000144959","ENSG00000197226","ENSG00000108294","ENSG00000198589","ENSG00000002822","ENSG00000085832","ENSG00000163681","ENSG00000185842","ENSG00000183077","ENSG00000204842","ENSG00000156414","ENSG00000023839","ENSG00000101557","ENSG00000109016","ENSG00000152359","ENSG00000101182","ENSG00000070915","ENSG00000141556")

set.seed(2021)
cCount = sample(2:2000,20)
FPKM = rep(0,20)

BSJ_info = data.frame(Chr = Chr, start_EXONSTART = start_EXONSTART, end_EXONEND = end_EXONEND, GENEID = GENEID, cCount = cCount, FPKM = FPKM)

circSet=c(1:15)
circInfo = BSJ_info[circSet,]
tandemInfo = BSJ_info[-circSet,]

Finally, we run the simulator:

simulation = Circall_simulator(circInfo = circInfo, tandemInfo = tandemInfo, useFPKM=FALSE, out_name = "Tutorial", gtfSqlite = "Homo_sapiens.GRCh37.75.sqlite", genomeFastaFile = "Homo_sapiens.GRCh37.75.dna.primary_assembly.fa", txFastaFile = "Homo_sapiens.GRCh37.75.cdna.all.fa", out_dir= "./simulation_test")

You can find in the “./simulation_test” that contains the outputs including:

  • simulation_setting: setting information of simulation of both circRNAs and tandem RNAs.
  • circRNA_data: RNA seq data of CircRNAs
  • tandem_data: RNA seq data of tandem RNA
  • fasta sequences of tandem RNAs
  • fasta sequences of circular RNAs

8. License

Circall uses GNU General Public License GPL-3.

9. References

Nguyen, Dat Thanh, Quang Thinh Trac, Thi-Hau Nguyen, Ha-Nam Nguyen, Nir Ohad, Yudi Pawitan, and Trung Nghia Vu. 2021. “Circall: Fast and Accurate Methodology for Discovery of Circular RNAs from Paired-End RNA-Sequencing Data.” BMC Bioinformatics 22 (1): 495. https://doi.org/10.1186/s12859-021-04418-8.

XAEM

Contents

1. Introduction
2. Download and installation
3. XAEM: step by step instruction and explanation
3.1 Preparation for the annotation reference
3.2 Quantification of transcripts
4. A practical copy-paste example of running XAEM
5. Dataset for differential expression (DE) analysis

1. Introduction

This document shows how to use XAEM [Deng et al., 2019] to quantify isoform expression for multiple samples.

What are new in version 0.1.2

  • Improve speed and fix bug for building CRP to work with complex annotations such as GENCODE and ENSEMBL, which usually have >200,000 isoforms for hg38. The X-matrix for human Ensembl GRCh38.95 can be downloaded here: X_matrix.RData

What are new in version 0.1.1

  • Add standard error for the estimates
  • Fix a small bug when separe a CRP into more than 1 CRP due to H_thres
  • Fix a small bug in function crpcount() to avoid the error when having only 1 CRP

Older versions

Software requirements for XAEM:

  • R version 3.3.0 or later with installed packages: foreach and doParallel
  • C++11 compliant compiler (g++ >= 4.7)
  • XAEM is currently tested in Linux OS environment

Annotation reference: XAEM requires a fasta file of transcript sequences and a gtf file of transcript annotation. XAEM supports all kinds of reference and annotation for any species.

The pre-built X-matrix for GRCh38.95 can be downloaded here: X_matrix.RData

In the XAEM paper,  we use the UCSC hg19 annotation:

  • Download the sequences of transcripts:transcripts.fa.gz
  • Download the annotation of transcripts: genes_annotation.gtf.gz
  • Download the design matrix X of this annotation:  X_matrix.RData (X matrix is an essential object for bias correction and isoform quantification, see Section 4.1.2 for more details)
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz
content/uploads/sites/4/XAEM_datasources/genes_annotation.gtf.gz
gunzip genes_annotation.gtf.gz
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

2. Download and installation

If you use the binary version of XAEM (recommended):

  • Download the latest binary version from XAEM website:
wget https://github.com/WenjiangDeng/XAEM/releases/download/v0.1.2/XAEM-binary-0.1.2.tar.gz
  • Uncompress to folder
tar -xzvf XAEM-binary-0.1.2.tar.gz
  • Move to the XAEM_home directory and do the configuration for XAEM
cd XAEM-binary-0.1.2
bash configure.sh
  • Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/XAEM-binary-0.1.2/lib:$LD_LIBRARY_PATH
export PATH=/path/to/XAEM-binary-0.1.2/bin:$PATH

If you want to build XAEM from sources:

  • Download XAEM  and move to XAEM_home directory
wget https://github.com/WenjiangDeng/XAEM/releases/download/v0.1.2/XAEM-source-0.1.2.tar.gz
tar -xzvf XAEM-source-0.1.2.tar.gz
cd XAEM-source-0.1.2
bash configure.sh
  • XAEM requires information of flags from Sailfish including DFETCH_BOOST, DBOOST_ROOT, DTBB_INSTALL_DIR and DCMAKE_INSTALL_PREFIX. Please refer to the Sailfish website for more details of these flags.
  • Do installation by the following command:
DBOOST_ROOT=/path/to/boostDir/ DTBB_INSTALL_DIR=/path/to/tbbDir/ DCMAKE_INSTALL_PREFIX=/path/to/expectedBuildDir bash install.sh
  • After the installation is finished, remember to add the paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/expectedBuildDir/lib:$LD_LIBRARY_PATH
export PATH=/path/to/expectedBuildDir/bin:$PATH

Do not forget to replace “/path/to/” by your local path.

3. XAEM: step by step instruction and explanation

XAEM mainly contains the following steps:

  • Preparation for the annotation reference:  to process the annotation of transcripts to get essential information for transcript quantification. This step includes 1) index transcript sequences and 2) Construct the design matrix X.
  • Quantification of transcripts:  to get input from multiple RNA-seq samples to do quasi-mapping, generate data for quantifying transcript expression. This step consists of 1) generate equivalence class table; 2) create Y count matrix and 3) estimate transcript expression using AEM algorithm to update the X matrix and transcript (isoform) expression.

3.1 Preparation for the annotation reference

3.1.1 Indexing transcripts

Using TxIndexer to index the transcript sequences in the reference file (transcripts.fa). For example:

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz
TxIndexer -t /path/to/transcripts.fa -o /path/to/TxIndexer_idx

 3.1.2 Construction of the X matrix (design matrix)

This step constructs the X matrix required by the XAEM pipeline. For users working with human annotation of UCSC hg19  the X matrix can be downloaded here: X_matrix.rdata (need to rename the file to X_matrix.RData).

Given file transcripts.fa containing the transcript sequences of an annotation reference, we construct the design matrix as follows.

  • a) Generate simulated RNA-seq data using the R-package “polyester”
## R-packages of "polyester" and "Biostrings" are required
Rscript XAEM_home/R/genPolyesterSimulation.R /path/to/transcripts.fa /path/to/design_matrix
  • b) Run GenTC to generate Transcript Cluster (TC) using the simulated data. GenTC will generate an eqClass.txt file as the input for next step.
GenTC -i /path/to/TxIndexer_idx -l IU -1 /path/to/design_matrix/sample_01_1.fasta -2 /path/to/design_matrix/sample_01_2.fasta -p 8 -o /path/to/design_matrix
  • c) Create a design matrix using buildCRP.R. The parameter setting for this function is as follows.
    • in: the input file (eqClass.txt) obtained from the last step.
    • out: the output file name (*.RData) which the design matrix will be saved.
    • H: (default H=0.025) is the threshold to filter false positive neighbors in each X matrix. (Please refer to the XAEM paper, Section 2.2.1)
Rscript XAEM_home/R/buildCRP.R in=/path/to/design_matrix/eqClass.txt out=/path/to/design_matrix/X_matrix.RData H=0.025

 3.2 Quantification of transcripts

Suppose we already created a working directory “XAEM_project” (/path/to/XAEM_project/) for quantification of transcripts.

 3.2.1 Generating the equivalence class table

The command to generate equivalence class table for each sample is similar to “sailfish quant”.  For example, we want to run XAEM for sample1 and sample2 with 4 cpus:

XAEM -i /path/to/TxIndexer_idx -l IU -1 s1_read1.fasta -2 s1_read2.fasta -p 4 -o /path/to/XAEM_project/sample1
XAEM -i /path/to/TxIndexer_idx -l IU -1 s2_read1.fasta -2 s2_read2.fasta -p 4 -o /path/to/XAEM_project/sample2
  • If the data is compressed in gz format. We can combine with gunzip for a decompression on-fly:
XAEM -i /path/to/TxIndexer_idx -l IU -1 <(gunzip -c s1_read1.gz) -2 <(gunzip -c s1_read2.gz) -p 4 -o /path/to/XAEM_project/sample1
XAEM -i /path/to/TxIndexer_idx -l IU -1 <(gunzip -c s2_read1.gz) -2 <(gunzip -c s2_read2.gz) -p 4 -o /path/to/XAEM_project/sample2
3.2.2 Creating Y count matrix

After running XAEM there will be the output of the equivalence class table for multiple samples. We then create the Y count matrix. For example, if we want to run XAEM parallelly using 8 cores, the command is:

Rscript Create_count_matrix.R workdir=/path/to/XAEM_project core=8

3.2.3 Updating the X matrix and transcript expression using AEM algorithm

When finish the construction of Y count matrix, we use the AEM algorithm to update the X matrix. The updated X matrix is then used to estimate the transcript (isoform) expression. The command is as follows.

Rscript AEM_update_X_beta.R workdir=/path/to/XAEM_project core=8 design.matrix=X_matrix.RData isoform.out=XAEM_isoform_expression.RData paralog.out=XAEM_paralog_expression.RData merge.paralogs=FALSE isoform.method=average remove.ycount=TRUE

Parameter setting

  • workdir: the path to working directory
  • core: the number of cpu cores for parallel computing
  • design.matrix: the path to the design matrix
  • isoform.out (default=XAEM_isoform_expression.RData):  the output contains the estimated expression of individual transcripts, where the paralogs are split into separate isoforms. This file contains two objects: isoform_count and isoform_tpm for estimated counts and normalized values (TPM). The expression of the individual isoforms is calculated with the corresponding setting of parameter “isoform.method” below.
  • isoform.method (default=average):  to report the expression of the individual members of a paralog as the average or total expression of the paralog set (value=average/total).
  • paralog.out (default=XAEM_paralog_expression.RData): the output contains the estimated expression of merged paralogs. This file consists of two objects: XAEM_count and XAEM_tpm  for the estimated counts and normalized values (TPM). The standard error of the estimate is supplied in object XAEM_se stored in *.standard_error.RData.
  • merge.paralogs (default=TRUE) (*): the parameter to turn on/off (value=TRUE/FALSE) the paralog merging in XAEM. Please see the details of how to use this parameter in the note at the end of this section.
  • remove.ycount (default=TRUE): to clean all data of Ycount after use.

The output in this step will be saved in XAEM_isoform_expression.RData, which is the TPM value and raw read counts of multiple samples.

Note: (*) In XAEM pipeline we provide this parameter (merge.paralog) to merge or not merge the paralogs within the updated X matrix (please see XAEM paper Section 2.2.3 and Section 2.3).  Turning on (default) the paralog merging step produces a more accurate estimation. Turning off this step can produce the same sets of isoforms between different projects.

4. A practical copy-paste example of running XAEM

This section presents a tutorial to run XAEM pipeline with a toy example. Suppose that input data contain two RNA-seq samples and server supplies 4 CPUs for computation. We can test XAEM by just copy and paste of the example commands.

  • Download the binary version of XAEM and do configuration
# Create a working folder
mkdir XAEM_example
cd XAEM_example
# Download the binary version of XAEM
wget https://github.com/WenjiangDeng/XAEM/releases/download/v0.1.2/XAEM-binary-0.1.2.tar.gz

# Configure the tool
tar -xzvf XAEM-binary-0.1.2.tar.gz
cd XAEM-binary-0.1.2
bash configure.sh

# Add the paths to system
export LD_LIBRARY_PATH=$PWD/lib:$LD_LIBRARY_PATH
export PATH=$PWD/bin:$PATH
cd ..
  • Download  annotation files and index the transcripts
## download annotation files
# Download the design matrix for the human UCSC hg19 annotation 
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

# Download the fasta of transcripts in the human UCSC hg19 annotation 
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/transcripts.fa.gz
gunzip transcripts.fa.gz

## Run XAEM indexer
TxIndexer -t transcripts.fa -o TxIndexer_idx
  • If using GRCh38.95, download the corresponding annotation files (Homo_sapiens.GRCh38.95.cdna.all.fa and Homo_sapiens.GRCh38.95.gtf) from Ensembl (http://jan2019.archive.ensembl.org/Homo_sapiens/Info/Index) and the X-matrix of GRCh38.95 from here: https://www.dropbox.com/s/x6a693v1y7must0/X_matrix.RData
  • Download the RNA-seq data of two samples: sample1 and sample2
## Download input RNA-seq samples
# Create a XAEM project to save the data
mkdir XAEM_project
cd XAEM_project

# Download the RNA-seq data
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample1_read1.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample1_read2.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample2_read1.fasta.gz
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/sample2_read2.fasta.gz
cd ..
  • Generate the equivalence class tables for these samples
# Number of CPUs
CPUNUM=4

# Process for sample 1
XAEM -i TxIndexer_idx -l IU -1 <(gunzip -c XAEM_project/sample1_read1.fasta.gz) -2 <(gunzip -c XAEM_project/sample1_read2.fasta.gz) -p $CPUNUM -o XAEM_project/sample1

# Process for sample 2
XAEM -i TxIndexer_idx -l IU -1 <(gunzip -c XAEM_project/sample2_read1.fasta.gz) -2 <(gunzip -c XAEM_project/sample2_read2.fasta.gz) -p $CPUNUM -o XAEM_project/sample2
  • Create Y count matrix
# Note: R packages "foreach" and "doParallel" are required for parallel computing
Rscript $PWD/XAEM-binary-0.1.2/R/Create_count_matrix.R workdir=$PWD/XAEM_project core=$CPUNUM design.matrix=$PWD/X_matrix.RData
  • Estimate isoform expression using AEM algorithm
Rscript $PWD/XAEM-binary-0.1.2/R/AEM_update_X_beta.R workdir=$PWD/XAEM_project core=$CPUNUM design.matrix=$PWD/X_matrix.RData isoform.out=XAEM_isoform_expression.RData paralog.out=XAEM_paralog_expression.RData

The outputs are stored in the folder of “XAEM_project” including XAEM_isoform_expression.RData and XAEM_paralog_expression.RData.

5. Dataset for differential expression (DE) analysis

In XAEM paper we have used the RNA-seq data from the breast cancer cell line (MDA-MB-231) for DE analysis. Since the original data was generated by our collaborators and not published yet, we provide the equivalence class table by running the read-alignment tool Rapmap, which is the same mapper of Salmon and totally independent from XAEM algorithm. We also prepare the R scripts and the guide to replicate the DE analysis results in the paper.

In this section, we present an instruction to download the data and run the scripts. We try to build the pipeline following the copy-paste manner in shell, but the part of R scripts must be run in R console.

5.1 Download the R-scripts and the design matrix

This step is to download the R-scripts, change directory to the folder containing the R-scripts and download the design matrix.

# Download R-scripts
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/RDR_brca_singlecell.zip
unzip RDR_brca_singlecell.zip
cd RDR_brca_singlecell

# Download the design matrix
wget -O X_matrix.RData https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/2022/09/X_matrix.rdata --no-check-certificate

5.2 Run XAEM from the equivalence class tables which are the output of read-alignment tool Rapmap

Download the data of equivalence classes

# Download the table of equivalence classes of the single cells which are the output of read-alignment tool Rapmap

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/brca_singlecell_eqclassDir.zip
unzip brca_singlecell_eqclassDir.zip

Run XAEM with the input from the equivalence class table using the R-codes below. Note:  This step takes about 2 hours using a personal computer with 4 CPUs. Users can consider skipping this step and downloading the available XAEM results for the downstream analysis.

# set the project path
projPath=getwd();
setwd(projPath)
source("collectDataOfXAEM.R")

If users want to download the available XAEM results

# Download the available results of XAEM

wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/XAEM_results.zip
unzip XAEM_results.zip

5.3 Differential-expression analysis of XAEM and other methods

Download the data of cufflinks and salmon. These files contain the read-count data of methods with and without using bias correction.

# Download the results of cufflinks
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/cufflinks_results.zip
unzip cufflinks_results.zip

# Download the results of salmons
wget https://www.meb.ki.se/sites/biostatwiki/wp-content/uploads/sites/4/XAEM_datasources/brca_singlecell/salmon_results.zip
unzip salmon_results.zip

Run the codes below in R to do normalization and differential expression analysis.

# set the project path
projPath=getwd();
setwd(projPath)

# Normalize the data of three methods XAEM, Salmon and Cufflinks
source("Isoform_Expression_CPM_Normalization.R")

# Do DE analysis and plot figures
source("DEanalysis_plots.R")

# output: DE_Analysis.png

The results of the differential expression analysis (Figure 1 below) are the plots (DE_Analysis.png) reproducing Figure 3 of the XAEM paper. Note that due to the randomness of 50 times’ run, the figure might be slightly different from the figure in the paper.

Figure 1. Detection and validation of differentially expressed (DE) isoforms using the MDA- MB-231 scRNA-seq dataset. XAEM, Salmon and Cufflinks are presented in blue-solid, red-dashed and grey-dotted lines, respectively. The x-axis shows the number of top DE isoforms in the training set; the y-axis is the proportion of rediscovery in the validation set. The rediscovery rate (RDR) is calculated by comparing the top 100, 500 and 1000 DE isoforms from the training set with all the significant DE isoforms from the validation set. The boxplots show the RDR from 50 times’ run. (a) Both training set and validation set are constructed using cells from batch 1. The quantification of XAEM, Salmon and Cufflinks is performed without bias correction. (b) The quantification from the three methods are bias- corrected. (c) The training set is constructed using cells from batch 1, while the validation set uses cells from batch 2. The RDR is calculated for only singleton isoforms. (d) The training set is constructed using cells from batch 1, and the validation set using cells from batch 2. The RDR is calculated using only non-paralogs.

References: 

  1. Deng, Wenjiang, Tian Mou, Nifang Niu, Liewei Wang, Yudi Pawitan, and Trung Nghia Vu. 2019. “Alternating EM Algorithm for a Bilinear Model in Isoform Quantification from RNA-Seq Data.” Bioinformatics.  https://doi.org/10.1093/bioinformatics/btz640.

 Content

  1. Introduction
  2. UPPMAX HPC systems
  3. Monitoring disk quota and core hour quota
  4. Interactive session
  5. Submitting batch jobs
  6. Using scratch
  7. Monitoring efficient of batch and interactive jobs
  8. Questions

1. INTRODUCTION

UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science) is Uppsala University’s high performance computing (HPC) system that our group uses to run heavy computations requiring many processors and/or huge memory. If your computations are not intense  you can use your PC. To run less heavy computations, consider other HPC systems such as skjold. We try to following below rules/suggestion when running jobs on Uppmax:

  1. Use “core” partition and avoid using “node” partition
    There are three different cluster computing systems in UPPMAX: Milou, Tintin, and Halvan. Each system has several nodes (computer units) and each node has 16 cores (except for Halvan). Use “node” partition only if you really intend to use exactly n*16 cores (n=1,2,…) from the specified node(s) or if you are reserving >=256 GB of memory for your computation job. Use the “node” partition very considerably and it is better to get approval from your supervisor before using “node” partition. In a normal situation, always use “core” partition and specify the number of cores you use.
  2. Your project pays for every minute and every processor you use Each month your project spends the budget to get 2000 core hours. So, please be really considerate on the number of processors (cores) and duration (walltime) when using UPPMAX. Make sure the number of cores does not exceed the requirement of your computation. Make sure the walltime is slightly bigger than its real runtime, because if you set 5 hours of walltime for a 6 hours job, you might get timeout and waste 5 core hours, times the number of cores you book, for no result at all. But if the walltime is too big, your job is put on very low priority.
  3. Use batch jobs and avoid using interactive session if necessary Batch job is a recommended way to run a computation job requiring many processors, by submitting a script file to the cluster. An interactive session allows you to skip writing a script file and manually write and execute commands on a multiple processors environment. Interactive session should only be used for development, debugging, and testing. When you are not optimally using the processor (i.e. idle, or editing scripts, or executing 4 cores computation when you reserve for 8 cores) in the interactive session, you are wasting the project’s core hour. Be wise in reserving the cores and walltimes when you use interactive session.
  4. Use scratch directory when you produce huge temporary files The disk space for our project (/proj/b2012036) is limited. Use it only to store the output data and the frequently accessed input data. If your computational job produces gigantic-sized temporary or intermediate files, then you really need to write a script for that job, adding some lines to divert the intermediate output files to the scratch directory, which has very huge disk space, and to move the final output files back to your project directory.

2. UPPMAX HPC SYSTEMS

There are three HPC systems we can use in UPPMAX, depending on the computational resource you need.

  1. Tintin (tintin.uppmax.uu.se)
    Tintin has 144 nodes with 64 GB memory, 16 with 128 GB, and 4 with GPU graphics card.
  2. Milou (milou.uppmax.uu.se)Milou has 174 nodes, each node has 128 GB, thus each core is assigned to 8 GB. Additionally, milou has 17 extra “medium” (256 GB) and 17 “fat” (512 GB) nodes. To reserve those medium and fat nodes, a special parameter needs to be added. If you need more memory (up to 2 TB), use Halvan instead. Beware that codes you compile on Tintin may not run in Milou, and vice versa, so it’s better to stay on Milou/Tintin, otherwise you might need to recompile in this case.
  3. Halvan (halvan.uppmax.uu.se)Halvan has 1 compute server, consisting of eight 8-core processors (=64 cores), and 2 TB shared memory, plus very fast scratch file system.

3. MONITORING DISC QUOTA & CORE HOUR QUOTA

It is important to regularly check the disk quota and core hour quota to see whether you are close or far from the limit. These are the commands:

uquota: check disc quota
uquota

projinfo: check core hour quota

If your core hour quota exceeds the limit, all of your jobs will be put in lower queue priority. So, make sure you check the core hours regularly and be wise on planning the job submission.

projinfo

projplot -A <projnumber> : yields a plot of core hour usage over a period

Screen Shot 2015-02-19 at 17.23.39

du -skh */: checks which directories are filling up project space

diskusage

4. INTERACTIVE SESSION

Remember that interactive session is only used for development, debugging, and testing. Running interactive session while urging for high priority (i.e. to start your interactive session immediately) and avoiding the use of too many core hours is very tricky. When you develop a program and want to debug/test, you often need interactive session with higher priority than the sbatch jobs. Here are some strategies to run interactive session efficiently:

      1. Use –qos=short for a 15-minutes interactive job
        This gives you VERY high priority but very short time to debug your code using up to 64 cores (4 nodes) and maximum 2 qos-short jobs can be run at the same time. This is of course only feasible for development and debugging, not for testing a long-running code, but it is useful if you want to debug your code in multicore environment when the queueing system is very crowded. The command is as follow:
interactive -A b2012036 -p core -n 4 -t 15:00 --qos=short
  • Use -p devcore for a 1-hour interactive “core” job
    If 15 minutes isn’t enough and you prefer running a “core” job (e.g. you want to use 2 or 4 cores in 1 hour), then you can reserve another high priority job named “devcore”.

    interactive -A b2012036 -p devcore -n 2 -t 1:00:00
  • Use -p devel for a 1-hour interactive “node” job
    If 15 minutes isn’t enough for you and you prefer running a “node” job (e.g. you want to use 15 or 30 cores in 1 hour), then you can reserve another high priority job named “devel”:

    interactive -A b2012036 -p devel -n 16 -t 1:00:00
  • If you don’t understand the difference between core and node, or if you don’t really need to use a whole node, please use devcore instead of devel.
    So, what’s the difference between “-p devcore -n 16” and “-p devel -n 16”? They are exactly the same.
  • If you REALLY need more than 1 hour of an interactive session, then you can use the usual parameters, i.e. -p core or -p node, according to your need. Here, you will have the same queueing priority with the sbatch jobs.
  • Reserve as small walltime as possible and avoid writing codes or being idle when you are opening an interactive session. Also, when you reserve several cores for an interactive session, always ask yourselves: “Can the number of cores be reduced?” When the interactive session starts, the “billing” starts. So, when you finish using interactive session but you still have time, you should exit immediately.

5. SUBMITTING BATCH JOBS

If your scripts are guaranteed to work without errors in the node, run your scripts as batch jobs. To make sure your scripts are ready for a batch job, try one script as an interactive job before executing hundreds of similar jobs as batch jobs. Here is an example of simple job submission.

sbatch -A b2012036 -p core -n 4 -t 1:00:00 -J jobname your_script_file.sh

In this example, you reserve 4 cores (-p core -n 4) for your_script_file.sh to be run in at most 1 hour (-t 1:00:00) and you name your job “jobname”. If you specify the walltime (i.e. -t) longer, you have a higher chance to finish your jobs as your job might finish longer than what you expected. For instance, if this job finishes within 30 minutes, you use 2 core hours (i.e. 4*.5 = 2), and it’s good. But if your job actually finishes in 2 hours (and stopped in the middle of running due to limited walltime), you are wasting 4 core hours for nothing and you have to resend your job using longer walltime anyway. If you assign longer walltime, it is wise for core hour quota, but your job will be put in lower priority than jobs with shorter walltime. Therefore, to use UPPMAX efficiently you must assign the walltime of your job “just over” the predicted duration of your job.

The general command is sbatch [options] your_script_file.sh, where options are parameters for job submissions separated by space, and you should prepare a job submission.

There are some parameters you need to know in batch job submission:

      1. -p
        Always use “-p core” (i.e. “core” job). Use “-p node” (i.e. “node” job, or booking a node/some nodes) only when you want to run on more than one nodes with less than 16 cores per node. For example, if you want to run 30 cores on two nodes (i.e. 16+14=30), then write “-p core -n 30”, but if you want to run 15+15=30 cores, then write “-p node -N 2 -n 30”. So if you need 16 cores in one node for your job, just specify “-p core -n 16”. If you need to run a “node” job requiring 128 GB per node, then you must use “-C mem” option. If you need to run a “core” job requiring more than (8*number_of_core) GB, it is recommended to use Halvan, so that you will not waste core hours while reserving bigger RAM.
      2. -n
        Number of cores
      3. -N
        Number of nodes
      4. -t
        Walltime, i.e. maximum duration of your job. The format is d-hh:mm:ss
      5. –mail-type=ALL
        You can choose between BEGIN (send email when the job begins), END (send email when the job finishes), FAIL (send email when the job fails), and ALL (send email for those occasions).
      6. –mail-user=your.email@ki.se
      7. -J
        Job name (max 8 digits), to remind yourself and to tell people, which job is which.
      8. -C memXXXGB
        Custom RAM for your job. Remember that in Milou 1 core = 8 GB, except for the medium and fat nodes, which are up to 256 GB per node and up to 512 GB per node, respectively. XXX is the memory requirement after multiplied by the number of reserved cores/nodes.
      9. -C usage_mail
        If you specify this, you will get an email on how much resource you spend on that job, so the next time you run the similar job you can reserve for the right number of cores and memory. If you want to specify the memory and usage_mail at the same time, you can write the parameter as such: -C “memXXXGB&usage_mail”.

To see the list of jobs currently running or in the queue, type:

squeue -u yourusername

To cancel specific jobs, execute scancel, followed by the job ID retrieved from squeue:

scancel 123456

Or to cancel all jobs:

scancel -u yourusername

6. USING SCRATCH

Everytime you are in an interactive session or in a batch session, a large temporary directory (i.e. the directory name equals to your job id) in /scratch is created just for your session, and this directory is deleted when you finish the session. So, in your script you may create large temporary/intermediate files, keep only the important files at the end, and remove the huge, unused temporary files. An environment variable $SNIC_TMP is created as an alias to your scratch folder, so you don’t need to always monitor the name of the directory. And remember that as long as the results are still questionable or can be changed later, you MUST put them in /proj/b2012036/nobackup/, not in /proj/b2012036.

This is an example on running SOMAC using 16 cores in Milou.

nobackup="/proj/b2012036/nobackup/mutations"
somac="/proj/b2012036/SOMAC"
dreamdata="/proj/b2012036/DREAM/Synthetic/Data"
mkdir -p $nobackup/dataset2_splitchr
mkdir -p $nobackup/dataset2_script
mkdir -p $nobackup/dataset2_result
eval sbatch -A b2012036 -p core -n 16 -t 5:00:00 --mail-user=dhany.saputra@ki.se --mail-type=ALL -J divide $somac/divide file1=$dreamdata/synthetic.challenge.set2.tumor.bam file2=$dreamdata/synthetic.challenge.set2.normal.bam chr=1-22,X,Y output=/proj/b2012036/nobackup/mutations/dataset2_splitchr
sleep 5; while squeue -u dhany | grep -q 'divide'; do counter=`expr $counter + 5`; if [[ $(( $counter % 60 )) == 0 ]];then echo "BAM splitting time elapsed = $(( $counter / 60 )) min"; fi; sleep 5; done; counter=`expr $counter + 5`; echo "Time elapsed for BAM splitting = $counter seconds"
cp $somac/config.cfg $nobackup/config.TN2
vi $nobackup/config.TN2 # Modify config.TN2 to use 8 cores (i.e. 8 for tumor file, 8 for normal file) and include $SNIC_TMP here
for i in {1,22,Y}
do
echo '#!/bin/bash -l' > $nobackup/dataset2_script/$i.txt
echo 'cp '$nobackup'/dataset2_splitchr/synthetic.challenge.set2.tumor.'$i'.bam $SNIC_TMP' >> $nobackup/dataset2_script/$i.txt
echo 'cp '$nobackup'/dataset2_splitchr/synthetic.challenge.set2.normal.'$i'.bam $SNIC_TMP' >> $nobackup/dataset2_script/$i.txt
echo 'cd '$somac >> $nobackup/dataset2_script/$i.txt
echo 'bash somac '$nobackup'/config.TN2 '$i >> $nobackup/dataset2_script/$i.txt
echo 'mv $SNIC_TMP/* '$nobackup'/dataset2_result' >> $nobackup/dataset2_script/$i.txt
eval sbatch -A b2012036 -p core -n 16 -t 24:00:00 --mail-user=dhany.saputra@ki.se --mail-type=ALL -J chr$i $nobackup/dataset2_script/$i.txt
done

These lines of command really depend on your case, so it’s not even a template, it’s just an example. The important commands in SOMAC lie on lines 7 (run divide) 10 (edit config) and 17 (run somac), but the rules regarding scratch, nobackup, and uppmax require us to write these 20 lines. Running a program in parallel on UPPMAX requires basic knowledge on Bash scripting, such as file copying and moving, using environment variables and arithmetical operators, looping, using conditional if-else, and so on.

The full explanation on different disk storage guide can be found here.

7. MONITORING THE EFFICIENCY OF YOUR BATCH/INTERACTIVE JOBS

It is also important to review the batch jobs you have submitted, so that you can revise your scripts and job submission parameters to eventually utilize N cores with nearly 100% CPU usage for each core.

        1. To find out our project’s usage of uppmax resources in the last 30 days:
          finishedjobinfo b2012036

          The information provided by this command is the Job ID, the job status (jobstate=COMPLETED/FAILED/TIMEOUT/CANCELLED), who submits the job (username), number of processors (procs), partition (node/core), maximum memory usage of that job (maxmemory_in_GiB), when the job is submitted (submit_time), the runtime, and the queuetime.
          uquota

        2. To find out whether your batch/interactive job wastes the project’s core hour for nothing:
          less /sw/share/slurm/milou/uppmax_jobstats/*/yourjobid

          Everytime you submit a batch/interactive job, you get a job ID. Replace “yourjobid” with that job ID to profile the CPU core usage of that job. See these two examples:
          projinfo
          The job in the above example uses 16 cores, thus the last 16 columns are the %CPU utilization for each core. Different rows show %CPU utilizations after every 5 minutes. Here’s another example of 100% CPU usage on 1 core:
          projinfo
          What if your submitted a multicore batch job has nearly 100% on 2 columns and nearly 0% on the other columns? It means that your script only uses 2 cores. Then you might want either to change your batch job submission parameters or to edit your script to make the CPU cores more occupied.

        3. To look at the efficiency plots using jobstats tool
          In this section, we briefly introduce how to use the jobstats tool for our project. Details of how to use the tool are available at http://www.uppmax.uu.se/discovering-job-resource-usage-with-jobstats

          # connect to uppmax with Xforwarding
          $ ssh -Y username@milou.uppmax.uu.se
          
          # run the jobstats tool for the project
          $ jobstats -p -A b2012036
          
          # view the images
          $ eog milou-b2012036-*

          then use the arrow keys to see job-by-job. An example of the images is shown below. This image presents the usage of Tophat with the setting of 16 cores for parallel computing.

          An example of core-hour usage plot

          An example of core-hour usage plot

        4. To efficiently run Tophat for optimised core-hour usage
          As suggested from UPPMAX supporters, we should set 04 cores for each sample instead of 16 cores. Thus, the run-time for individual sample would be slightly longer, but total time to finish all samples will faster. As we can see in the core-hour usage plot of the previous section, Tophat allows parallel computing (16 cores) in few steps of its pipeline that makes a lot of core-hours wasted. If reducing16 cores to 4 cores, Tophat would have used ~25% of core-hours only.

8. TECHNICAL QUESTIONS

TrungNghia.Vu@ki.se