Run Bwa Aln With Paired End Reads
Provided by: bwa_0.seven.5a-2_amd64
Proper name
bwa - Burrows-Wheeler Alignment Tool
SYNOPSIS
bwa alphabetize ref.fa bwa mem ref.fa reads.fq > aln-se.sam bwa mem ref.fa read1.fq read2.fq > aln-pe.sam bwa aln ref.fa short_read.fq > aln_sa.sai bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam bwa bwasw ref.fa long_read.fq > aln.sam
DESCRIPTION
BWA is a software parcel for mapping low-divergent sequences against a large reference genome, such as the human being genome. It consists of 3 algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads upwards to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share like features such every bit long-read back up and separate alignment, but BWA-MEM, which is the latest, is mostly recommended for high-quality queries as information technology is faster and more authentic. BWA-MEM also has ameliorate performance than BWA-backtrack for 70-100bp Illumina reads. For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the alphabetize command). Alignment algorithms are invoked with different sub-commands: aln/samse/sampe for BWA-backtrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm.
COMMANDS AND OPTIONS
index bwa index [-p prefix] [-a algoType] db.fa Index database sequences in the FASTA format. OPTIONS: -p STR Prefix of the output database [same as db filename] -a STR Algorithm for constructing BWT index. BWA implements two algorithms for BWT construction: is and bwtsw. The outset algorithm is a niggling faster for minor database but requires large RAM and does not piece of work for databases with total length longer than 2GB. The second algorithm is adapted from the BWT-SW source lawmaking. It in theory works with database with trillions of bases. When this option is non specified, the appropriate algorithm will exist chosen automatically. mem bwa mem [-aCHMpP] [-t nThreads] [-k minSeedLen] [-w bandWidth] [-d zDropoff] [-r seedSplitRatio] [-c maxOcc] [-A matchScore] [-B mmPenalty] [-O gapOpenPen] [-Due east gapExtPen] [-L clipPen] [-U unpairPen] [-R RGline] [-five verboseLevel] db.prefix reads.fq [mates.fq] Align 70bp-1Mbp query sequences with the BWA-MEM algorithm. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW). If mates.fq file is absent and choice -p is not set, this command regards input reads are unmarried-end. If mates.fq is present, this command assumes the i-thursday read in reads.fq and the i-th read in mates.fq constitute a read pair. If -p is used, the command assumes the twoi-th and the (2i+one)-thursday read in reads.fq establish a read pair (such input file is said to be interleaved). In this example, mates.fq is ignored. In the paired-cease style, the mem control will infer the read orientation and the insert size distribution from a batch of reads. The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for dissimilar role of a query sequence. This is a crucial feature for long sequences. However, some tools such as Picard's markDuplicates does non work with split up alignments. One may consider to utilise option -Thou to flag shorter split hits equally secondary. OPTIONS: -t INT Number of threads [one] -k INT Minimum seed length. Matches shorter than INT will be missed. The alignment speed is usually insensitive to this value unless information technology significantly deviates 20. [19] -w INT Band width. Substantially, gaps longer than INT will not exist establish. Notation that the maximum gap length is too afflicted by the scoring matrix and the hitting length, not solely determined past this option. [100] -d INT Off-diagonal X-dropoff (Z-dropoff). Stop extension when the departure between the best and the current extension score is above |i-j|*A+INT, where i and j are the current positions of the query and reference, respectively, and A is the matching score. Z-dropoff is similar to BLAST's X-dropoff except that information technology doesn't penalize gaps in one of the sequences in the alignment. Z-dropoff non only avoids unnecessary extension, but also reduces poor alignments inside a long good alignment. [100] -r FLOAT Trigger re-seeding for a MEM longer than minSeedLen*FLOAT. This is a key heuristic parameter for tuning the operation. Larger value yields fewer seeds, which leads to faster alignment speed but lower accuracy. [1.five] -c INT Discard a MEM if it has more than INT occurence in the genome. This is an insensitive parameter. [10000] -P In the paired-end manner, perform SW to rescue missing hits only simply exercise not try to find hits that fit a proper pair. -A INT Matching score. [1] -B INT Mismatch penalty. The sequence error rate is approximately: {.75 * exp[-log(4) * B/A]}. [4] -O INT Gap open penalty. [6] -E INT Gap extension penalty. A gap of length k costs O + thou*East (i.e. -O is for opening a nil-length gap). [1] -L INT Clipping penalty. When performing SW extension, BWA-MEM keeps rail of the best score reaching the end of query. If this score is larger than the best SW score minus the clipping penalisation, clipping volition not be applied. Note that in this case, the SAM Every bit tag reports the best SW score; clipping penalty is not deducted. [5] -U INT Punishment for an unpaired read pair. BWA-MEM scores an unpaired read pair equally scoreRead1+scoreRead2-INT and scores a paired equally scoreRead1+scoreRead2-insertPenalty. It compares these two scores to make up one's mind whether we should force pairing. A larger value leads to more aggressive read pair. [17] -p Assume the get-go input query file is interleaved paired-end FASTA/Q. See the control description for details. -R STR Complete read group header line. '\t' can be used in STR and will be converted to a TAB in the output SAM. The read group ID will be attached to every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null] -T INT Don't output alignment with score lower than INT. This option affects output and occasionally SAM flag 2. [xxx] -a Output all found alignments for single-end or unpaired paired-finish reads. These alignments will be flagged as secondary alignments. -C Append append FASTA/Q comment to SAM output. This choice tin can be used to transfer read meta information (e.chiliad. barcode) to the SAM output. Note that the FASTA/Q comment (the cord afterwards a space in the header line) must suit the SAM spec (e.g. BC:Z:CGTAC). Malformated comments atomic number 82 to incorrect SAM output. -H Use hard clipping 'H' in the SAM output. This option may dramatically reduce the redundancy of output when mapping long contig or BAC sequences. -K Mark shorter dissever hits as secondary (for Picard compatibility). -v INT Control the verbose level of the output. This option has not been fully supported throughout BWA. Ideally, a value 0 for disabling all the output to stderr; 1 for outputting errors simply; 2 for warnings and errors; iii for all normal messages; iv or higher for debugging. When this option takes value 4, the output is not SAM. [three] aln bwa aln [-n maxDiff] [-o maxGapO] [-eastward maxGapE] [-d nDelTail] [-i nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-Thou misMsc] [-O gapOsc] [-Due east gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > <out.sai> Find the SA coordinates of the input reads. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence and maximum maxDiff differences are allowed in the whole sequence. OPTIONS: -n NUM Maximum edit altitude if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04] -o INT Maximum number of gap opens [ane] -eastward INT Maximum number of gap extensions, -1 for one thousand-difference fashion (disallowing long gaps) [-1] -d INT Disallow a long deletion within INT bp towards the iii'-finish [16] -i INT Disallow an indel within INT bp towards the ends [5] -l INT Accept the beginning INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for `-k 2'. [inf] -k INT Maximum edit altitude in the seed [2] -t INT Number of threads (multi-threading manner) [1] -1000 INT Mismatch penalty. BWA volition not search for suboptimal hits with a score lower than (bestScore-misMsc). [3] -O INT Gap open penalty [11] -East INT Gap extension penalisation [iv] -R INT Keep with suboptimal alignments if there are no more INT every bit best hits. This option simply affects paired-end mapping. Increasing this threshold helps to amend the pairing accurateness at the toll of speed, especially for short reads (~32bp). -c Reverse query just non complement it, which is required for alignment in the color infinite. (Disabled since 0.six.x) -Due north Disable iterative search. All hits with no more than than maxDiff differences will be plant. This manner is much slower than the default. -q INT Parameter for read trimming. BWA trims a read downward to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where 50 is the original read length. [0] -I The input is in the Illumina 1.3+ read format (quality equals ASCII-64). -B INT Length of barcode starting from the five'-terminate. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0] -b Specify the input read sequence file is the BAM format. For paired-end information, 2 ends in a pair must be grouped together and options -1 or -two are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are: bwa aln ref.fa -b1 reads.bam > 1.sai bwa aln ref.fa -b2 reads.bam > 2.sai bwa sampe ref.fa 1.sai ii.sai reads.bam reads.bam > aln.sam -0 When -b is specified, just utilize unmarried-cease reads in mapping. -1 When -b is specified, only use the offset read in a read pair in mapping (skip single-stop reads and the second reads). -2 When -b is specified, only use the second read in a read pair in mapping. samse bwa samse [-north maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam> Generate alignments in the SAM format given single-end reads. Repetitive hits will be randomly chosen. OPTIONS: -northward INT Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag volition non be written. [3] -r STR Specify the read grouping in a format similar `@RG\tID:foo\tSM:bar'. [null] sampe bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-Northward maxHitDis] [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam> Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will exist placed randomly. OPTIONS: -a INT Maximum insert size for a read pair to exist considered being mapped properly. Since 0.4.5, this choice is only used when there are not enough good alignment to infer the distribution of insert sizes. [500] -o INT Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a unmarried-end read. Reducing this parameter helps faster pairing. [100000] -P Load the entire FM-alphabetize into memory to reduce disk operations (base-space reads only). With this option, at to the lowest degree 1.25N bytes of memory are required, where N is the length of the genome. -northward INT Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not exist written. [3] -North INT Maximum number of alignments to output in the XA tag for disconcordant read pairs (excluding singletons). If a read has more than than INT hits, the XA tag will not be written. [10] -r STR Specify the read group in a format like `@RG\tID:foo\tSM:bar'. [null] bwasw bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-North nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq] Align query sequences in the in.fq file. When mate.fq is nowadays, perform paired- terminate alignment. The paired-end mode only works for reads Illumina brusque-insert libraries. In the paired-finish style, BWA-SW may yet output split alignments simply they are all marked every bit not properly paired; the mate positions volition not exist written if the mate has multiple local hits. OPTIONS: -a INT Score of a match [i] -b INT Mismatch penalization [iii] -q INT Gap open up penalty [5] -r INT Gap extension penalty. The penalty for a contiguous gap of size g is q+1000*r. [2] -t INT Number of threads in the multi-threading mode [1] -w INT Band width in the banded alignment [33] -T INT Minimum score threshold divided past a [37] -c Bladder Coefficient for threshold aligning according to query length. Given an l-long query, the threshold for a hit to be retained is a*max{T,c*log(l)}. [five.5] -z INT Z-best heuristics. Higher -z increases accurateness at the cost of speed. [1] -south INT Maximum SA interval size for initiating a seed. Higher -s increases accuracy at the toll of speed. [3] -N INT Minimum number of seeds supporting the resultant alignment to skip reverse alignment. [5]
SAM ALIGNMENT FORMAT
The output of the `aln' control is binary and designed for BWA use only. BWA outputs the concluding alignment in the SAM (Sequence Alignment/Map) format. Each line consists of: ┌────┬───────┬──────────────────────────────────────────────────────────┐ │Col │ Field │ Description │ ├────┼───────┼──────────────────────────────────────────────────────────┤ │ 1 │ QNAME │ Query (pair) Proper name │ │ two │ FLAG │ bitwise FLAG │ │ 3 │ RNAME │ Reference sequence NAME │ │ 4 │ POS │ i-based leftmost POSition/coordinate of clipped sequence │ │ 5 │ MAPQ │ MAPping Quality (Phred-scaled) │ │ 6 │ CIAGR │ extended CIGAR string │ │ 7 │ MRNM │ Mate Reference sequence NaMe (`=' if aforementioned every bit RNAME) │ │ eight │ MPOS │ 1-based Mate POSistion │ │ 9 │ ISIZE │ Inferred insert SIZE │ │ten │ SEQ │ query SEQuence on the same strand as the reference │ │eleven │ QUAL │ query QUALity (ASCII-33 gives the Phred base quality) │ │12 │ OPT │ variable OPTional fields in the format TAG:VTYPE:VALUE │ └────┴───────┴──────────────────────────────────────────────────────────┘ Each flake in the FLAG field is defined as: ┌────┬────────┬───────────────────────────────────────┐ │Chr │ Flag │ Description │ ├────┼────────┼───────────────────────────────────────┤ │ p │ 0x0001 │ the read is paired in sequencing │ │ P │ 0x0002 │ the read is mapped in a proper pair │ │ u │ 0x0004 │ the query sequence itself is unmapped │ │ U │ 0x0008 │ the mate is unmapped │ │ r │ 0x0010 │ strand of the query (1 for opposite) │ │ R │ 0x0020 │ strand of the mate │ │ 1 │ 0x0040 │ the read is the first read in a pair │ │ 2 │ 0x0080 │ the read is the second read in a pair │ │ s │ 0x0100 │ the alignment is not chief │ │ f │ 0x0200 │ QC failure │ │ d │ 0x0400 │ optical or PCR duplicate │ └────┴────────┴───────────────────────────────────────┘ The Please check <http://samtools.sourceforge.net> for the format specification and the tools for post-processing the alignment. BWA generates the post-obit optional fields. Tags starting with `X' are specific to BWA. ┌────┬───────────────────────────────────────────────────────┐ │Tag │ Significant │ ├────┼───────────────────────────────────────────────────────┤ │NM │ Edit distance │ │Physician │ Mismatching positions/bases │ │As │ Alignment score │ │BC │ Barcode sequence │ ├────┼───────────────────────────────────────────────────────┤ │X0 │ Number of best hits │ │X1 │ Number of suboptimal hits establish by BWA │ │XN │ Number of ambiguous bases in the referenece │ │XM │ Number of mismatches in the alignment │ │XO │ Number of gap opens │ │XG │ Number of gap extentions │ │XT │ Type: Unique/Repeat/N/Mate-sw │ │XA │ Alternative hits; format: /(chr,pos,CIGAR,NM;)*/ │ ├────┼───────────────────────────────────────────────────────┤ │XS │ Suboptimal alignment score │ │XF │ Support from forwards/reverse alignment │ │XE │ Number of supporting seeds │ ├────┼───────────────────────────────────────────────────────┤ │XP │ Alt main hits; format: /(chr,pos,CIGAR,mapQ,NM;)+/ │ └────┴───────────────────────────────────────────────────────┘ Note that XO and XG are generated past BWT search while the CIGAR cord by Smith-Waterman alignment. These two tags may exist inconsistent with the CIGAR string. This is non a problems.
NOTES ON SHORT-READ ALIGNMENT
Alignment Accuracy When seeding is disabled, BWA guarantees to find an alignment containing maximum maxDiff differences including maxGapO gap opens which exercise non occur within nIndelEnd bp towards either cease of the query. Longer gaps may be constitute if maxGapE is positive, but it is not guaranteed to observe all hits. When seeding is enabled, BWA further requires that the offset seedLen subsequence contains no more maxSeedDiff differences. When gapped alignment is disabled, BWA is expected to generate the same alignment as Eland version ane, the Illumina alignment programme. However, as BWA modify `Due north' in the database sequence to random nucleotides, hits to these random sequences will besides be counted. As a consequence, BWA may marking a unique hit as a echo, if the random sequences happen to be identical to the sequences which should be unqiue in the database. Past default, if the best hitting is non highly repetitive (controlled past -R), BWA also finds all hits contains one more mismatch; otherwise, BWA finds all equally best hits simply. Base quality is NOT considered in evaluating hits. In the paired-finish mode, BWA pairs all hits it found. It further performs Smith-Waterman alignment for unmapped reads to rescue reads with a high erro charge per unit, and for high-quality dissonant pairs to fix potential alignment errors. Estimating Insert Size Distribution BWA estimates the insert size distribution per 256*1024 read pairs. It outset collects pairs of reads with both ends mapped with a single-finish quality 20 or college and then calculates median (Q2), lower and higher quartile (Q1 and Q3). It estimates the mean and the variance of the insert size distribution from pairs whose insert sizes are within interval [Q1-two(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair considered to be properly paired (SAM flag 0x2) is calculated past solving equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the standard error of the insert size distribution, Fifty is the length of the genome, p0 is prior of dissonant pair and Phi() is the standard cumulative distribution role. For mapping Illumina curt-insert reads to the human genome, x is nearly half-dozen-7 sigma abroad from the mean. Quartiles, mean, variance and x will be printed to the standard error output. Retentivity Requirement With bwtsw algorithm, 5GB memory is required for indexing the complete human genome sequences. For short reads, the aln control uses ~3.2GB retentivity and the sampe control uses ~5.4GB. Speed Indexing the human genome sequences takes 3 hours with bwtsw algorithm. Indexing smaller genomes with IS algorithms is faster, merely requires more than retentivity. The speed of alignment is largely determined by the error rate of the query sequences (r). Firstly, BWA runs much faster for near perfect hits than for hits with many differences, and it stops searching for a striking with l+2 differences if a l-deviation hitting is found. This ways BWA will exist very wearisome if r is high because in this case BWA has to visit hits with many differences and looking for these hits is expensive. Secondly, the alignment algorithm behind makes the speed sensitive to [k log(N)/m], where thousand is the maximum allowed differences, N the size of database and m the length of a query. In do, we choose 1000 westward.r.t. r and therefore r is the leading cistron. I would not recommend to use BWA on information with r>0.02. Pairing is slower for shorter reads. This is mainly because shorter reads have more than spurious hits and converting SA coordinates to chromosomal coordinates are very costly.
CHANGES IN BWA-0.6
Since version 0.half-dozen, BWA has been able to work with a reference genome longer than 4GB. This feature makes it possible to integrate the forward and reverse complemented genome in one FM-index, which speeds up both BWA-short and BWA-SW. As a tradeoff, BWA uses more memory because information technology has to keep all positions and ranks in 64-bit integers, twice larger than 32-fleck integers used in the previous versions. The latest BWA-SW also works for paired-terminate reads longer than 100bp. In comparing to BWA- brusk, BWA-SW tends to be more authentic for highly unique reads and more than robust to relative long INDELs and structural variants. Yet, BWA-curt commonly has college power to distinguish the optimal hit from many suboptimal hits. The choice of the mapping algorithm may depend on the awarding.
SEE Besides
BWA website <http://bio-bwa.sourceforge.net>, Samtools website <http://samtools.sourceforge.net>
Author
Heng Li at the Sanger Establish wrote the fundamental source codes and integrated the following codes for BWT construction: bwtsw <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at the Academy of Hong Kong and IS <http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge <http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and implemented by Yuta Mori.
LICENSE AND Citation
The full BWA bundle is distributed nether GPLv3 as it uses source codes from BWT-SW which is covered by GPL. Sorting, hash tabular array, BWT and IS libraries are distributed nether the MIT license. If you lot use the BWA-backtrack algorithm, please cite the post-obit newspaper: Li H. and Durbin R. (2009) Fast and authentic brusk read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168] If yous use the BWA-SW algorithm, please cite: Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505] If yous employ BWA-MEM or the fastmap component of BWA, delight cite: Li H. (2013) Adjustment sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v1 [q-bio.GN]. It is likely that the BWA-MEM manuscript will not announced in a peer-reviewed journal.
HISTORY
BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW and mimics its binary file formats; BWA-SW resembles BWT-SW in several means. The initial idea about BWT- based alignment also came from the grouping who developed BWT-SW. At the same fourth dimension, BWA is different plenty from BWT-SW. The short-read alignment algorithm bears no similarity to Smith-Waterman algorithm whatever more than. While BWA-SW learns from BWT-SW, it introduces heuristics that can hardly be practical to the original algorithm. In all, BWA does not guarantee to find all local hits as what BWT-SW is designed to do, merely information technology is much faster than BWT-SW on both short and long query sequences. I started to write the first piece of codes on 24 May 2008 and got the initial stable version on 02 June 2008. During this period, I was acquainted that Professor Tak-Wah Lam, the first writer of BWT-SW paper, was collaborating with Beijing Genomics Institute on SOAP2, the successor to Lather (Curt Oligonucleotide Assay Package). SOAP2 has come out in November 2008. According to the SourceForge download page, the third BWT-based brusk read aligner, bowtie, was first released in August 2008. At the time of writing this manual, at least iii more BWT-based short-read aligners are being implemented. The BWA-SW algorithm is a new component of BWA. It was conceived in November 2008 and implemented x months later. The BWA-MEM algorithm is based on an algorithm finding super-maximal exact matches (SMEMs), which was first published with the fermi assembler paper in 2012. I first implemented the bones SMEM algorithm in the fastmap command for an experiment and then extended the bones algorithm and added the extension function in Feburary 2013 to make BWA-MEM a fully featured mapper.
Source: http://manpages.ubuntu.com/manpages/trusty/man1/bwa.1.html
0 Response to "Run Bwa Aln With Paired End Reads"
Post a Comment