{"id":139,"date":"2014-06-03T10:07:46","date_gmt":"2014-06-03T10:07:46","guid":{"rendered":"https:\/\/jira-test.meb.ki.se\/wpsites\/biostatwiki\/?p=139"},"modified":"2016-08-11T13:07:03","modified_gmt":"2016-08-11T13:07:03","slug":"sequgio","status":"publish","type":"post","link":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/sequgio\/","title":{"rendered":"Sequgio"},"content":{"rendered":"<h2 class=\"titleHead\">An Introduction to Sequgio<\/h2>\n<div class=\"author\">Contents<\/div>\n<div class=\"tableofcontents\"><span class=\"sectionToc\">1 <a id=\"QQ2-1-2\" href=\"#x1-20001\">Introduction<\/a><\/span><\/div>\n<div class=\"tableofcontents\">2. Installing Sequgio<\/div>\n<div class=\"tableofcontents\"><span class=\"sectionToc\">3\u00a0<a id=\"QQ2-1-3\" href=\"#x1-30002\">Example data<\/a><\/span><\/div>\n<div class=\"tableofcontents\"><span class=\"subsectionToc\">3.1 <a id=\"QQ2-1-4\" href=\"#x1-40002.1\">Step 1: creating the annotation template<\/a>\u00a0(txdb)<\/span><\/div>\n<div class=\"tableofcontents\"><span class=\"subsectionToc\">3.2 <a id=\"QQ2-1-5\" href=\"#x1-50002.2\">Step 2: creating the design matrix object<\/a><\/span><\/div>\n<div class=\"tableofcontents\"><span class=\"subsectionToc\">3.3 <a id=\"QQ2-1-6\" href=\"#x1-60002.3\">Step 3: importing BAM files to create\u00a0<span class=\"aeti-10\">counts <\/span>matrix<\/a><\/span><\/div>\n<div class=\"tableofcontents\"><span class=\"subsectionToc\">3.4 <a id=\"QQ2-1-8\" href=\"#x1-80002.4\">Step 4: fiting models<\/a><\/span><\/div>\n<div class=\"tableofcontents\">4. \u00a0Other comments<\/div>\n<div class=\"tableofcontents\">4.1\u00a0<a id=\"QQ2-1-7\" href=\"#x1-70002.3.1\">Fixing QNAME<\/a><\/div>\n<div class=\"tableofcontents\">4.2 Alternative counting method<\/div>\n<div class=\"tableofcontents\"><\/div>\n<p><!--l. 47--><\/p>\n<h3 class=\"sectionHead\"><span class=\"titlemark\">1 <\/span> <a id=\"x1-20001\"><\/a>Introduction<\/h3>\n<p style=\"text-align: justify;\">This document provides a brief guide to the <strong><span class=\"aeti-10\">Sequgio <\/span>package<\/strong>, which is a package for gene isoform expression estimation, using a joint statistical model accounting for non-uniform isoform-specific read distribution (Suo et all. 2014).\u00a0There are <strong>four\u00a0components<\/strong> to this package:<\/p>\n<ul>\n<li class=\"indent\">constructing annotation template txdb<\/li>\n<li class=\"indent\">constructing design matrices used in expression estimation<\/li>\n<li class=\"indent\">importing BAM files to create counts matrix<\/li>\n<li class=\"indent\">estimating\u00a0the expression levels, and optionally the read distribution and standard error of the expression estimates<\/li>\n<\/ul>\n<p class=\"indent\" style=\"text-align: justify;\">Read intensity in the RNA-sequencing data is often not uniform, in which case the currently standard methods of gene and isoform expression estimates result in biased estimates. Sequgio accounts for non-uniform isoform-specific read distribution, yielding improved gene isoform expression estimation.\u00a0A statistical regularization with <span class=\"cmmi-10\">L<\/span><sub><span class=\"cmr-7\">1<\/span><\/sub> smoothing penalty is imposed to control the estimation. Also, for estimability reasons, the method uses information across samples from\u00a0the same gene (Suo et all. 2014).<\/p>\n<h3 class=\"sectionHead\"><span class=\"titlemark\">2\u00a0<\/span><a id=\"x1-20001\"><\/a>Installing Sequgio<\/h3>\n<p class=\"indent\" style=\"text-align: justify;\">Sequgio methods is implemented as an R package and will be soon available at bioconductor.org. In the meantime please use our\u00a0developer version found on <a href=\"https:\/\/github.com\/Senbee\/Sequgio\">GitHub<\/a><\/p>\n<p><!--\n\n\n<pre><code><strong>Obtaining Sequgio package from Bioconductor (soon to be released):<\/strong>\r\nFrom R execute below commands:\r\n&gt; source(\"http:\/\/bioconductor.org\/biocLite.R\")\r\n&gt; library(BiocInstaller)\r\n&gt; biocLite(\"Sequgio\")\r\n&gt; library(Sequgio) --> <strong>Obtaining Sequgio package from GitHub (available now: 2015\/01\/20)<\/strong> From R execute below commands\u00a0\r\n\r\n<pre><code>install.packages(\"devtools\")\r\nlibrary(devtools)\r\ndev_mode(on=T)\r\ninstall_github(\"Sequgio\",\"Senbee\",ref = \"Stable\")\r\ndev_mode(on=F) \r\n<\/code><\/pre>\n\n\n\n\n<h3 class=\"sectionHead\"><span class=\"titlemark\">3\u00a0<\/span><a id=\"x1-30002\"><\/a>Example data<\/h3>\n\n\n<!--l. 99--> We show\u00a0the functionality Sequgio package\u00a0using RNA sequencing samples provided in\u00a0the\u00a0<em>RNAseqData.HNRNPC.bam.chr14<\/em>. For demonstration purposes, Sequgio provides annotation template object (txdb, obtained in Step1) and design matrix (Design, obtained in Step2).<\/p>\n<pre><code>&gt;\u00a0library(RNAseqData.HNRNPC.bam.chr14)\r\n&gt; library(\"TxDb.Hsapiens.UCSC.hg19.knownGene\")\r\n&gt; data (\"TxDb\")\r\n&gt; data (\"Design\")\r\n\r\n<strong>For better performances the package supports parallel computing via the\u00a0<em>BiocParallel<\/em>\u00a0package which is loaded\u00a0automatically. For parallel processing set the parameters to the ones suiting your platform. We will use sequencial\u00a0computation here. \r\n<\/strong>&gt; param = SerialParam()\r\n<\/code><\/pre>\n<h4 class=\"subsectionHead\"><span class=\"titlemark\">3.1 <\/span> <a id=\"x1-40002.1\"><\/a>Step 1: creating annotation template (txdb)<\/h4>\n<p><!--l. 117--><\/p>\n<p class=\"noindent\" style=\"text-align: justify;\">The first step is providing\u00a0<em><span class=\"aett-10\">TranscriptDb <\/span><\/em>object that stores\u00a0transcript annotation information. Here, we will use the one provided by the package <span class=\"aeti-10\"><em>TxDb.Hsapiens.UCSC.hg19.knownGen<\/em>e<\/span>.<\/p>\n<p class=\"noindent\" style=\"text-align: justify;\">The <em><span class=\"aett-10\">TranscriptDb\u00a0<\/span><\/em>objects is then\u00a0preprocessed to generate <span class=\"aeti-10\">disjoint <\/span>regions using the <em><span class=\"aett-10\">reshapeTxDb <\/span><\/em>function. To obtain accurate results, set the read length parameter to match your experiment (72 nucleotides for the example data).<\/p>\n<p class=\"noindent\" style=\"text-align: justify;\">Further, since the example data\u00a0contain only\u00a0chromosome 14 we limit\u00a0<em><span class=\"aett-10\">TranscriptDb <\/span><\/em><span class=\"aett-10\">\u00a0to this chromosome as well<\/span><em><span class=\"aett-10\">\u00a0<\/span><\/em><span class=\"aett-10\">(<\/span>reducing computational burden).<\/p>\n<pre><code><strong># Example data<\/strong>\r\n&gt; seqs = seqnames(seqinfo(TxDb.Hsapiens.UCSC.hg19.knownGene))\r\n&gt;\u00a0sel\u00a0= \u00a0rep(FALSE,length(seqs))\r\n&gt;\u00a0names(sel) = \u00a0seqs\r\n&gt;\u00a0sel[\"chr14\"] =\u00a0TRUE\r\n&gt;\u00a0isActiveSeq(TxDb.Hsapiens.UCSC.hg19.knownGene) = \u00a0sel\r\n&gt;\u00a0txdb = reshapeTxDb(TxDb.Hsapiens.UCSC.hg19.knownGene,probelen =\u00a072L, mcpar=param)\r\n\r\n<strong># For the real data obtaining txdb could look like below\r\n<\/strong>&gt; txdb.ucsc = makeTranscriptDbFromGFF(file=\"pathway\/to\/genes.gft\", format=\"gtf\", dataSource=paste(http:\/\/cufflinks.cbcd.umd.edu\/igenomes.html\", sep=\"\")), species=\"hg19.2014-06-02-13-47-56\")\r\n&gt; seqs = seqnames(seqinfo(txdb.ucsc))\r\n&gt; sel = rep(TRUE, length(seqs))\r\n&gt; names(sel) = seas\r\n&gt; isActiveSeq(txdb.ucsc)=sel\r\n&gt; txdb = reshapeTxDb(txdb.ucss, probelen = 48L, mcpar=param_core)<\/code><\/pre>\n<p class=\"noindent\">This step has to be repeated only if the experimental parameters change, e.g. annotation database, reads length, with\/without junctions etc.<!--l. 153--><\/p>\n<p class=\"noindent\">If the user wants to focus on a subset of the &#8220;gene\/features&#8221; in the genome, this can be done using the argument <em>include.only<\/em>. This takes a GRanges object as input. The purpose is to limit resources usage at BAM files input: this will be restricted only to the specific regions supplied.<\/p>\n<pre><code>\r\n\r\n<em># Let's consider only two genes for example<\/em>\r\n\r\n&gt; GENES = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)[c(1,3)]\r\n&gt; txdb = reshapeTxDb(TxDb.Hsapiens.UCSC.hg19.knownGene,probelen =\u00a072L, \r\n            include.only=GENES, mcpar=param)\r\n\r\n<\/code><\/pre>\n<p>&nbsp;<\/p>\n<p>The ranges can be also be supplied later in the pipeline (see <em>setCounts<\/em>).<\/p>\n<h4 class=\"subsectionHead\"><span class=\"titlemark\">3.2 <\/span> <a id=\"x1-50002.2\"><\/a>Step 2: creating design matrix<\/h4>\n<p><!--l. 156--><\/p>\n<p class=\"noindent\">The second step is to create design matrices for each \u201ctranscriptional unit\u201d (see references). These matrices will be\u00a0used in the fitting procedure. <!--l. 159--><\/p>\n<p class=\"indent\">Several parameters can be tuned. The main one regards which kind of library is the experiment using: <span class=\"aeti-10\">paired-end\u00a0<\/span>(\u201cPE\u201d) or <span class=\"aeti-10\">single-end <\/span>(\u201cSE\u201d, the default). This can be set with the <span class=\"aett-10\">method <\/span>argument. The required <span class=\"aett-10\"><em>mulen<\/em> <\/span>argument\u00a0provides an estimate of the average <span class=\"aeti-10\">fragment length<\/span>. <!--l. 165--><\/p>\n<p class=\"indent\">Here we will use paired-end data.<\/p>\n<pre><code>&gt; Design\u00a0&lt;-\u00a0makeXmatrix(txdb, method=\"PE\", mulen=155, sd=50, mcpar=param)\r\n<\/code><\/pre>\n<p class=\"indent\"><em>This step has to be only performed once, as step 1.<\/em><\/p>\n<h4 class=\"subsectionHead\"><span class=\"titlemark\">2.3 <\/span> <a id=\"x1-60002.3\"><\/a>Step 3: importing BAM files and create <span class=\"aeti-10\">counts\u00a0<\/span>matrix<\/h4>\n<p><!--l. 189--><\/p>\n<p class=\"noindent\">Models are fit based on a matrix with read counts for every <span class=\"aeti-10\">region <\/span>in every sample. We will now import the aligned\u00a0read counts in BAM files into <span class=\"aess-10\">R<\/span>as object called \u2019allCounts\u2019. To do so we\u00a0create a <span class=\"aett-10\">target <\/span>object storing\u00a0the filenames (with full path) and sample names to be used for count matrix headings. If BAI file are available, they\u00a0can be provided in the <span class=\"aett-10\">target <\/span>object. <!--l. 196--><\/p>\n<p class=\"indent\">The resulting object (<span class=\"aett-10\">allCounts<\/span>) will count for every exons the overlapping reads. <!--l. 199--><\/p>\n<p class=\"indent\">Let\u2019s first create the <span class=\"aett-10\">BamFileList <\/span>object pointing to the BAM files. Data are pair-end so we set <span class=\"aett-10\">asMates <\/span>to be\u00a0TRUE.<\/p>\n<pre><code>&gt;\u00a0bflist\u00a0&lt;-\u00a0BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES, asMates=TRUE)\r\n<\/code><\/pre>\n<p class=\"indent\">Then we create a <span class=\"aett-10\">seqCounts <\/span>object that will store some informations used in counting. A file name for file\u00a0backed matrix can be provided or a random one will be generated. Also the root directory where the files are stored\u00a0can be specified (defaults to working directory). <!--l. 213--><\/p>\n<p class=\"indent\">If the backing files (*.bin and *.desc) are then moved elsewhere, the files location informations in the <span class=\"aett-10\">seqCounts\u00a0<\/span>will have to be updated.<\/p>\n<p class=\"indent\">By default the function <span class=\"aett-10\">seqCounts\u00a0<\/span>will save the created object in the working directory, with the same fileName as *.bin and *.desc files. If this file is lost, it can be recreated allowing to link again to the *.bin file. To do specify <span class=\"aett-10\">seqCounts(&#8230;,existDesc=&#8221;fileName.desc&#8221;)<\/span>. A control on new exons\/samples length will be performed but not on sample and exon names.<\/p>\n<pre><code>&gt; seqObj\u00a0&lt;- setCounts(bflist,txdb,fileName=\"test\")\r\n<\/code><\/pre>\n<p class=\"noindent\">If the user would like to reduce I\/O resources usage when reading in BAM files, this can be restricted to specific region of interest specifying a set of ranges using <span class=\"aett-10\">includeOnly<\/span><\/p>\n<pre><code>\r\n<em> #This would show if any subset is already set<\/em>\r\n&gt; includeOnly(seqObj)\r\n<em> # Here we set it (see reshapeTxDb for GRanges definition example) <\/em>\r\n&gt; includeOnly(seqObj) = GENES\r\n<em> # Or if we would like to remove it <\/em>\r\n&gt; includeOnly(seqObj) = NULL\r\n<\/code><\/pre>\n<p class=\"indent\">We then perform the actual counting with the function <span class=\"aett-10\">doCounts <\/span>(no return value). This function saves the\u00a0counts in the shared\/filebacked <span class=\"aett-10\">big.matrix <\/span>object.<\/p>\n<pre><code>&gt;\u00a0doCounts(seqObj,mcpar=param)\r\n<\/code><\/pre>\n<p><!--l. 234--><\/p>\n<p class=\"indent\">We can finally import in RAM the counts creating a standard R\u00a0<span class=\"aett-10\">matrix<\/span>.<\/p>\n<pre><code>&gt; allCounts &lt;- getCounts(seqObj)<\/code><\/pre>\n<p class=\"indent\">In case the same <span class=\"aett-10\">big.matrix <\/span>is used for more counting it should be <span class=\"aeti-10\">reset <\/span>to have counts zero. <!--l. 245--><\/p>\n<p class=\"indent\">To do so use the <span class=\"aett-10\">resetCounts <\/span>function<\/p>\n<pre><code>&gt; resetCounts(seqObj)\r\n<\/code><\/pre>\n<p class=\"indent\">Again for ease of computation counts object is already provided.<\/p>\n<pre><code>&gt; data(\u201cCounts\u201d)\r\n<\/code><\/pre>\n<p class=\"indent\">We can see how many read we counted<\/p>\n<pre><code>&gt;\u00a0colSums(allCounts)\r\n<\/code><\/pre>\n<p class=\"indent\">The counting procedure can be performed on single BAM files too or a subset of a BamFileList<\/p>\n<p class=\"indent\">For a single file we can proceed as follows<\/p>\n<pre><code>&gt; bfl &lt;- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1],asMates=TRUE)\r\n&gt; seqObj1 &lt;- setCounts(bfl,txdb,fileName='test')\r\n<\/code><\/pre>\n<p class=\"indent\">and then perform the following steps as described. <!--l. 286--><\/p>\n<p class=\"indent\">For a subset of a <span class=\"aett-10\">seqCounts <\/span>we can proceed as described in the following code, using columns indicators\u00a0(<span class=\"aeti-10\">integers<\/span>)<\/p>\n<pre><code>&gt; doCounts(seqObj1,mcpar=param,which.sample=c(1L,4L))\r\n<\/code><\/pre>\n<p class=\"indent\">or similarly using sample names<\/p>\n<pre><code>&gt; doCounts(seqObj1,mcpar=param,which.sample=c('ERR127306','ERR127309'))\r\n<\/code><\/pre>\n<p class=\"indent\">This last procedure would allow to have different <span class=\"aeti-10\">nodes <\/span>of an HPC platform to count a subset of the samples i\u00a0parallel (each nodes get a different <span class=\"aett-10\">which.sample <\/span>vector), while each node can use multiple <span class=\"aeti-10\">cores <\/span>to count\u00a0individual samples. All the counts will be stored in the same shared matrix.<\/p>\n<p class=\"indent\">It also possible to subset a single sample (or many samples) based on genomic position. This can be achieved using <em>filterBam<\/em> function in <em>Rsamtools<\/em> package.<\/p>\n<p class=\"indent\">For example to read in the whole chr22 (in a human genome) we could proceed as follow<\/p>\n<pre><code>bfl\u00a0&lt;-\u00a0BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1],asMates=TRUE,\r\n               yieldSize=100000)\r\nseqObj\u00a0&lt;- setCounts(bfl,txdb,fileName=\"test\")\r\nlenChr22 = seqlengths(txdb)['chr22']\r\nbam.params = ScanBamParam(simpleCigar = FALSE, reverseComplement = FALSE, \r\n                what = c(\"qname\", \"qwidth\", \"mapq\"), \r\n                flag = scanBamFlag(isPaired = TRUE, isUnmappedQuery = FALSE, \r\n                isDuplicate = FALSE,isNotPassingQualityControls = FALSE, \r\n                hasUnmappedMate = FALSE),\r\n                which = RangesList(chr22=IRanges(1,lnChr22))))\r\n\r\ntmpFile = filterBam(bfl,tempfile(),param=bam.params)\r\ndoCounts(seqObj,mcpar=param)\r\n\r\n<\/code><\/pre>\n<p class=\"indent\"><!--l. 309--><\/p>\n<h5 class=\"subsubsectionHead\"><span class=\"titlemark\">2.3.1 <\/span> <a id=\"x1-70002.3.1\"><\/a>Fixing QNAME<\/h5>\n<p><!--l. 311--><\/p>\n<p class=\"noindent\">Some preprocessing pipelines deliver BAM files for paired-end reads where every mate has a slightly different\u00a0QNAME. E.g. using the SRA toolkit to convert to fastq (fastq-dump), it will generate out two fastq files, and the\u00a0QNAME in each of the fastq files will be appended with a \u201c.1\u201d for the first pairs and a \u201c.2\u201d for the second pairs.\u00a0Similarly we might find that pairs have a different prefix. <!--l. 318--><\/p>\n<p class=\"indent\">The matching procedure implemented in <span class=\"aeti-10\">Rsamtools <\/span>requires mates to have the same QNAME, so a trimming is\u00a0required. This can be achieved when declaring the <span class=\"aett-10\">BamFile <\/span>or <span class=\"aett-10\">BamFileList <\/span>object using the arguments\u00a0<span class=\"aett-10\">\u2019qnamePrefixEnd\u2019 <\/span>or <span class=\"aett-10\">\u2019qnameSuffixStart\u2019<\/span>. <!--l. 323--><\/p>\n<p class=\"indent\">Unique qnames aren\u2019t a problem in this sample file &#8211; just using it to demonstrate the usage. <!--l. 326--><\/p>\n<p class=\"indent\">First no trimming<\/p>\n<pre><code>&gt; fl &lt;- system.file(\"extdata\", \"ex1.bam\", package=\"Rsamtools\")\r\n&gt; param &lt;- ScanBamParam(what=\"qname\") \r\n&gt; bf &lt;- BamFile(fl, asMates=TRUE)\r\n&gt; scanBam(bf, param=param)[[1]]$qname[1:3]\r\n<\/code><\/pre>\n<p><!--l. 337--><\/p>\n<p class=\"indent\">then trim prefix<\/p>\n<pre><code>&gt; qnamePrefixEnd(bf) = \"_\"\r\n&gt; scanBam(bf, param=param)[[1]]$qname[1:3]\r\n<\/code><\/pre>\n<p><!--l. 346--><\/p>\n<p class=\"indent\">and now trim suffix also<\/p>\n<pre><code>&gt; qnameSuffixStart(bf) = \":\"\r\n&gt; scanBam(bf, param=param)[[1]]$qname[1:3]\r\n<\/code><\/pre>\n<div class=\"fancyvrb\">In real situation the easiest approach is to specify trimming character directly in the call to BamFile or BamFileList.<\/div>\n<div class=\"fancyvrb\"><\/div>\n<div class=\"fancyvrb\">For example in TCGA data (e.g. BRCA and LUNG) we have the following QNAME pattern for a pair of reads:<\/div>\n<div class=\"fancyvrb\">UNC14-SN744_125:6:2108:11845:164113\/1<\/div>\n<div class=\"fancyvrb\">UNC14-SN744_125:6:2108:11845:164113\/2<\/div>\n<div class=\"fancyvrb\">so we need to trim everything after (including) &#8220;\/&#8221;<\/div>\n<div class=\"fancyvrb\"><\/div>\n<div class=\"fancyvrb\">We would then define the BamFile object (or list if more than one) as follow<\/div>\n<pre><code>&gt; <\/code><code>bf &lt;- BamFile(Filename,asMates=TRUE,qnameSuffixStart=\"\/\") ## single file\r\n<\/code><code>&gt; <\/code><code>bfl &lt;- BamFileList(Filenames,asMates=TRUE,qnameSuffixStart=\"\/\") ## more than one BAM<\/code><\/pre>\n<p><!--l. 355--><\/p>\n<p class=\"noindent\"><strong>## ALTERNATIVE WAY FOR STEP 3 WITH BASH-PYTHON (without fixing Qname) ##<\/strong><\/p>\n<p class=\"noindent\">This alternative method takes ~15 min=1 core=1 BAM file, and each BAM file can be counted in parallel (e.g. 12 samples = 12 parallel batch jobs)<\/p>\n<p class=\"noindent\">1. Save your txdb into a text file.<\/p>\n<pre><code>&gt; write.table(as.data.frame(txdb), \"mytxdb.txt\", sep=\"\\t\", row.names=FALSE, col.names=FALSE, quote=FALSE)\r\n<\/code><\/pre>\n<p>We already have pre-compiled TXDB for general purpose: \/proj\/b2012036\/Dhany\/Sequgio\/txdb_hg19.txt and \/proj\/b2012036\/Dhany\/Sequgio\/txdb_grch.txt<\/p>\n<p class=\"noindent\">2. Write the list of annotated exon-pairs.<\/p>\n<p>Do this, so that we can ignore novel exon pairs in this counting step. Make sure you still have the txdb variable in R:<\/p>\n<pre>&gt; ex_list &lt;- split(values(txdb@unlistData)$exon_name,values(txdb@unlistData)$tx_name)\r\n&gt; reg_vec &lt;- sapply(split(values(txdb@unlistData)$region_id,values(txdb@unlistData)$tx_name),function(x) x[1])\r\n&gt; sizes_ex_list &lt;- sapply(ex_list,length)\r\n&gt; n.exons &lt;- sum((sizes_ex_list^2+sizes_ex_list)\/2)\r\n&gt; exons.names &lt;- unique(.Call(\"makeExNames\",ex_list,reg_vec,as.integer(n.exons)))\r\n&gt; write.table(exons.names, \"exons19.txt\", row.names=FALSE, col.names=FALSE, quote=FALSE, sep=\"\\t\")<\/pre>\n<p>We already have pre-compiled list of annotated exon-pairs for general purpose: \/proj\/b2012036\/Dhany\/Sequgio\/exons19.txt and \/proj\/b2012036\/Dhany\/Sequgio\/exonsGrch.txt<\/p>\n<p class=\"noindent\">3. Preview your merged BAM file.<\/p>\n<p>Go to your bash command line, either open a new bash terminal or quit R. Find the pattern for the headers (i.e. first column) and answer the following questions:<br \/>\n(i) Can we tell which one is from pair 1 and 2? (0=no, 1=yes)<br \/>\n(ii) What&#8217;s the delimiter that separate headers and pair identifier?<br \/>\n(iii) Is the header on the left or right of the delimiter? (0=left, 1=right)<\/p>\n<pre><code>\r\n&gt; samtools view -f 2 merged.bam | head\r\nUNC17-SN1277:47:C0Y3YACXX:4:2205:18544:178539\/1 81      chr1    10551   57      48M   ...\r\nUNC17-SN1277:47:C0Y3YACXX:4:2208:7653:169627\/2  163     chr1    11193   54      48M   ...\r\n<\/code><\/pre>\n<p>In the above example, (i) yes (=1) (ii) \/ (iii) left (=0). Thus, we define the &#8220;sep&#8221; parameter (separated by comma) as: sep=1,\/,0<\/p>\n<p>Beware that in some cases, the pair identifier is not \/1 and \/2. You must decide whether there&#8217;s separator or not.<\/p>\n<p class=\"noindent\">4. List which chromosomes you want to count.<\/p>\n<p>First, you need to know the chromosome names of the BAM files: 1,2,&#8230;,X,Y, or chr1,chr2,&#8230;,chrX,chrY. The names can be found on the second column of the BAM view (see above). Second, you need to mention which chromosomes you want to work with because there are other chromosomes in the BAM file, e.g. chrM, patch\/putative chromosomes. To include those other chromosomes, you need to check whether your txdb recognizes those other chromosomes, too.<\/p>\n<p class=\"noindent\">5. Run the Sequgio counting.<\/p>\n<p>Run it in a BASH command line (still outside R). Specify the merged BAM file, chromosomes, txdb file, and output file. For &gt;1 sample, it&#8217;s recommended to run in parallel (see <a href=\"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/uppmax\/\">Uppmax<\/a>), not in serial. Never run many samples without batch\/interactive session! Here is an example of counting for 1 sample:<\/p>\n<pre><code>\r\n&gt; cd \/proj\/b2012036\/TCGA\/BRCA_BAM_files\/\r\n&gt; bash \/proj\/b2012036\/Dhany\/Sequgio\/newsequgiocount.sh file=UNCID_1120271.5969d3bb-19fb-4bda-9483-db4791257b95.sorted_genome_alignments.bam sep=1,\/,0 chr=chr1-22,chrX,chrY txdb=\/proj\/b2012036\/Dhany\/Sequgio\/txdb_hg19.txt output=UNCID_1120271.5969d3bb-19fb-4bda-9483-db4791257b95.sorted_genome_alignments.bam.txt\r\n<\/code><\/pre>\n<p class=\"noindent\">6. Join the counts of several samples<\/p>\n<p>Several samples need to be counted so that you can continue to the next step. After joining the sample counts into 1 file, you can come back to R.<\/p>\n<p>The syntax is: python makeAllcountMatrix.py [list of annotated exon-pairs (see no. 2)] [countfile_1] &#8230; [countfile_N] [samplename_1] &#8230; [samplename_N] &gt; out.txt<\/p>\n<pre><code>\r\n&gt; cd \/proj\/b2012036\/TCGA\/BRCA_BAM_files\/\r\n&gt; python \/proj\/b2012036\/Dhany\/Sequgio\/makeAllcountMatrix.py \/proj\/b2012036\/Dhany\/Sequgio\/exons19.txt UNCID_1310380.5db1c37a-2329-4bd3-baf2-89d26a8959fa.sorted_genome_alignments.bam.txt UNCID_1311429.2efa4fd0-1c0b-49c8-a11f-606d8618ec97.sorted_genome_alignments.bam.txt UNCID_1313654.c14f7362-25c3-4e1a-acd7-871bb22ea8a3.sorted_genome_alignments.bam.txt UNCID_1314025.4543cecd-6333-473b-b5ce-96d33ab47706.sorted_genome_alignments.bam.txt UNCID_1338780.aaec0cbc-a493-48f4-9cc7-32629880ee24.sorted_genome_alignments.bam.txt UNCID_1338780.aaec0cbc-a493-48f4-9cc7-32629880ee24.sorted_genome_alignments.bam.txt UNCID_1367534.c7d2b685-5f11-4879-82d5-e983b5dbb07d.sorted_genome_alignments.bam.txt 1310380 1311429 1313654 1314025 1338780 1338780 1367534 &gt; sample.counts\r\n&gt; allCounts <\/code><\/pre>\n<h4 class=\"subsectionHead\"><span class=\"titlemark\">2.4 <\/span> <a id=\"x1-80002.4\"><\/a>Step 4: fit models<\/h4>\n<p><!--l. 358--><\/p>\n<p class=\"noindent\">Using the region(exons)-by-sample counts matrix (<span class=\"aett-10\">allCounts<\/span>) and the design matrices object (<span class=\"aett-10\">Design<\/span>) we can now\u00a0fit models. <!--l. 362--><\/p>\n<pre># model fitting\r\ndata(Counts)\r\ndata(Design)\r\niGenes &lt;- names(Design) \r\n# Fit a single transcriptional unit (one element in Design) \r\nfit1 &lt;- fitModels(iGenes[22],design=Design,counts=allCounts)\r\n# More than one using list\/for loops\/mclapply\/etc\r\nfit2 &lt;- lapply(iGenes[21:22],fitModels,design=Design,counts=allCounts)\r\n# Fit all \r\nfitAll&lt;-bplapply(iGenes, function(x){fitModels(x,design=Design,counts=allCounts)},BPPARAM=param)\r\n})<\/pre>\n<p>It is possible to get gene-level expression values adding up isoform estimates within gene. To do so proceed as follow<\/p>\n<pre><code>\r\n\r\n&gt; sumByGene(fitAll,txdb)\r\n\r\n<\/code><\/pre>\n<p>where aFit is a list containing estimates (output of fitModels) and txdb is either a data.frame or a GRangesList (as returned by reshapeTxDb). The required columns are &#8220;gene_id&#8221; and &#8220;tx_name&#8221;.<\/p>\n<h3 class=\"likesectionHead\"><a id=\"x1-90002.4\"><\/a>References<\/h3>\n<p class=\"noindent\"><!--l. 374--><\/p>\n<p class=\"noindent\">1]<a id=\"XSC\"><\/a>Suo C, Calza S, Salim A and Pawitan Y (2014). Joint estimation of isoform expression and isoform-specific read distribution using multisample RNA-Seq data. <span class=\"aeti-10\">Bioinformatics<\/span>, 30, pages 506\u2013513<\/p>\n","protected":false},"excerpt":{"rendered":"<p>An Introduction to Sequgio Contents 1 Introduction 2. Installing Sequgio 3\u00a0Example data 3.1 Step 1: creating the annotation template\u00a0(txdb) 3.2 Step 2: creating the design matrix object 3.3 Step 3: importing BAM files to create\u00a0counts matrix 3.4 Step 4: fiting models 4. \u00a0Other comments 4.1\u00a0Fixing QNAME 4.2 Alternative counting method 1 Introduction This document provides [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[3],"class_list":["post-139","post","type-post","status-publish","format-standard","hentry","category-rna-seq","tag-rnaseq"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/posts\/139","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/comments?post=139"}],"version-history":[{"count":96,"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/posts\/139\/revisions"}],"predecessor-version":[{"id":729,"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/posts\/139\/revisions\/729"}],"wp:attachment":[{"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/media?parent=139"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/categories?post=139"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.meb.ki.se\/sites\/biostatwiki\/wp-json\/wp\/v2\/tags?post=139"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}