BEDTools Version 2.13.0
Owing to travel and time spent trying to get my new lab up and running, it's been a few months since the last BEDTools release (v2.12.0). Today I released version 2.13.0, which includes three new tools, as well as several useful new options and bug fixes. I am grateful to all the people that have helped with this release. I've tried to highlight their efforts below, but I am sure I've missed something. My apologies to those whose efforts I've missed. Please feel free to email me or the bedtools mailing list with suggestions or comments.
New tools
multiBamCov. This new tool counts sequence coverage for multiple position-sorted and indexed bams at specific loci defined in a BED/GFF/VCF file. In the example below, the last 3 columns represent the number of alignments overlapping each interval from the three BAM file. Also, multiBamCov works with a single BAM file and because each interval in the BED/GFF/VCF file is explored and reported in order, this serves as an alternative to coverageBed, which reports the output in a different order than the B file.
$ multiBamCov -bams aln.1.bam aln.2.bam aln3.bam -bed exons.bed chr1 861306 861409 SAMD11 1 + 181 280 236 chr1 865533 865718 SAMD11 2 + 249 365 374 chr1 866393 866496 SAMD11 3 + 162 298 322
The following options are available to control which types of alignments are are counted. Many thanks to Chip Stewart for the addition of the -D and -F options.
-q Minimum mapping quality allowed. Default is 0. -D Include duplicate-marked reads. Default is to count non-duplicates only -F Include failed-QC reads. Default is to count pass-QC reads only -p Only count proper pairs. Default is to count all alignments with MAPQ greater than the -q argument, regardless of the BAM FLAG field.
tagBam. This tool annotates a BAM file with custom tag fields based on overlaps with BED/GFF/VCF files. The default tag type is "YB", but with the -tag option, one can specify custom tag types. In the example below, for alignments that have overlaps, you should see new BAM tags like "YB:Z:exonic", "YB:Z:cpg;utr":
$ tagBam -i aln.bam -files exons.bed introns.bed cpg.bed utrs.bed -labels exonic intonic cpg utr > aln.tagged.bam
nucBed. This new tool profiles the nucleotide content of intervals in a fasta file. Thanks to Can Alkan for suggesting a header line. The following information will be reported after each original BED/GFF/VCF entry:
1) %AT content 2) %GC content 3) Number of As observed 4) Number of Cs observed 5) Number of Gs observed 6) Number of Ts observed 7) Number of Ns observed 8) Number of other bases observed 9) The length of the explored sequence/interval. 10) The sequence extracted from the FASTA file. (optional, if -seq is used) 11) The number of times a user defined pattern was observed. (optional, if -pattern is used.)
For example:
$ nucBed -fi ~/data/genomes/hg18/hg18.fa -bed simrep.bed | head -3 #1_usercol 2_usercol 3_usercol 4_usercol 5_usercol 6_usercol 7_pct_at 8_pct_gc 9_num_A 10_num_C 11_num_G 12_num_T 13_num_N 14_num_oth 15_seq_len chr1 10000 10468 trf 789 + 0.540598 0.459402 155 96 119 98 0 0 468 chr1 10627 10800 trf 346 + 0.445087 0.554913 54 55 41 23 0 0 173
One can also report the sequence itself:
$ nucBed -fi ~/data/genomes/hg18/hg18.fa -bed simrep.bed -seq | head -3 #1_usercol 2_usercol 3_usercol 4_usercol 5_usercol 6_usercol 7_pct_at 8_pct_gc 9_num_A 10_num_C 11_num_G 12_num_T 13_num_N 14_num_oth 15_seq_len 16_seq chr1 10000 10468 trf 789 + 0.540598 0.459402 155 96 119 98 0 0 468 ccagggg... chr1 10627 10800 trf 346 + 0.445087 0.554913 54 55 41 23 0 0 173 TCTTTCA...
Or, one can count the number of times that a specific pattern occur in the intervals (reported as the last column):
$ nucBed -fi ~/data/genomes/hg18/hg18.fa -bed simrep.bed -pattern CGTT | head #1_usercol 2_usercol 3_usercol 4_usercol 5_usercol 6_usercol 7_pct_at 8_pct_gc 9_num_A 10_num_C 11_num_G 12_num_T 13_num_N 14_num_oth 15_seq_len 16_user_patt_count chr1 10000 10468 trf 789 + 0.540598 0.459402 155 96 119 98 0 0 468 0 chr1 10627 10800 trf 346 + 0.445087 0.554913 54 55 41 23 0 0 173 0 chr1 10757 10997 trf 434 + 0.370833 0.629167 49 70 81 40 0 0 240 0 chr1 11225 11447 trf 273 + 0.463964 0.536036 44 86 33 59 0 0 222 0 chr1 11271 11448 trf 187 + 0.463277 0.536723 37 69 26 45 0 0 177 0 chr1 11283 11448 trf 199 + 0.466667 0.533333 37 64 24 40 0 0 165 0 chr1 19305 19443 trf 242 + 0.282609 0.717391 17 57 42 22 0 0 138 1 chr1 20828 20863 trf 70 + 0.428571 0.571429 10 7 13 5 0 0 35 0 chr1 30862 30959 trf 79 + 0.556701 0.443299 35 22 21 19 0 0 97 0
New options
Support for "named pipes" and FIFOs. My sincere thanks to Davide Cittaro, Michael Hoffman and Nate Weeks for the help in working this out. This allows things like:
$ intersectBed -a <(head a.gff) -b <(head b.gff) ### OR ### $ mkfifo tmp_pipe $ awk '$7 == 0' ID137308R292.cov > tmp_pipe & $ intersectBed -a <(awk '$7 == 0' ID137308.cov) -b tmp_pipe
All BEDTools now allow the use of "-" to indicate that data is being sent via stdin. In order to allow backwards compatibility, "stdin" is also allowed.
Multiple tools. Added new -S (that is, opposite strands) option to annotateBed, closestBed, coverageBed, intersectBed, pairToBed, subtractBed, and windowBed (-Sm). This new option does the opposite of the -s option: that is, overlaps are only processed if they are on opposite strands. Thanks to Sol Katzman for the great suggestion. Very useful for certain RNA-seq analyses.
coverageBed. Added a new -counts option to coverageBed that only reports the count of overlaps, instead of also computing fractions, etc. This is much faster and uses much less memory.
genomeCoverageBed. Added new -scale option that allows the coverage values to be scaled by a constant. Useful for normalizing coverage with RPM, RPKM, etc. Thanks to Ryan Dale for the useful suggestion. Added new -5, -3, -trackline, -trackopts, and -dz options. Many thanks to Assaf Gordon for these improvements.
-5: Calculate coverage of 5" ends (instead of entire interval) -3: Calculate coverage of 3" ends (instead of entire interval). -trackline: Adds a UCSC/Genome-Browser track line definition in the first line of the output. -trackopts: rites additional track line definition parameters in the first line. -dz: Report the depth at each genome position with zero-based coordinates, instead of zero-based.
closestBed. See below, thanks to Brent Pedersen, Assaf Gordon, Ryan Layer and Dan Webster for the helpful discussions.
closestBed now reports _all_ features in B that overlap A by default. This allows folks to decide which is the "best" overlapping feature on their own.
closestBed now has a "-io" option that ignores overlapping features. In other words, it will only report the closest, non-overlapping feature.
An example:
$ cat a.bed chr1 10 20
$ cat b.bed chr1 15 16 chr1 16 40 chr1 100 1000 chr1 200 1000
$ bin/closestBed -a a.bed -b b.bed chr1 10 20 chr1 15 16 chr1 10 20 chr1 16 40
$ closestBed -a a.bed -b b.bed -io chr1 10 20 chr1 100 1000
Updates
Updated to the latest version of BamTools. This allows greater functionality and will facilitate new options and tools in the future.
Bug Fixes
GFF files cannot have zero-length features.
Corrected an erroneous check on the start coordinates in VCF files. Thanks to Jan Vogel for the correction.
mergeBed now always reports output in BED format.
Updated the text file Tokenizer() function to yield 15% speed improvement.
Various tweaks and improvements.



















