genetic.code @geneticcode-blog - Tumblr Blog

What is in a (chromosome) name?

Have you ever been incensed by the ridiculous number of chromosome naming and ordering schemes that exist in genomics? If the answer is "no", then either you are an incredibly patient person, you enjoy unnecessary chaos, or you just haven't done any detailed analysis of genomics datasets. One of the main research areas in my lab is the development of well-tested, easy-to-use software for genomic research (I'll be the first to admit that we have room to improve...). The pursuit of the "easy to use" goal is complicated by many factors such as the size and complexity of the datasets we deal with. Bad algorithms that happily succeed with smallish datasets quickly fail miserably when we encounter, to quote Titus Brown, "datasets of abnormal size". But perhaps the most annoying complexity comes from the lack of a standard for naming and ordering chromosomes. For example, the largest human chromosome is often named "chr1", "1", "Chr1". Further still, as Deanna Church pointed out to me on Twitter, we could use a GenBank ID "CM000663.1" or the RefSeq ID "NC_000001.10" [see NCBI for details](http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/).

@aaronquinlan better yet, take advantage of the robust data management you get from GenBank and use accession.versions: ncbi.nlm.nih.gov/assembly/GCF_0…

January 29, 2013

Myriad naming systems pose a more vexing problem for algorithm development. Specifically, one way for genomics tools to keep pace with data scale is to employ algorithms that exploit *pre-sorted* data. When data is pre-sorted, one can often avoid loading full datasets into memory and instead, "sweep" through the data and do one's work on "the fly" (e.g., the "chromsweep" algorithm invoked by the `-sorted` option in bedtools). However, the crucial issue is that most cases, we have to test if one chromosome is before or after another; thus, we must know what the sorting order is. Because of the diverse naming schemes above, most tool developers store chromosome names internally as strings. That is, whether it is "chr1" or truly the number 1, it is stored as "chr1" or "1" because we cannot predict ahead of time what naming scheme you have chosen. Annoyed yet? In almost every programming language that I know of, storing chromosome names as strings forces code, by default using operators such as `==`, `<`, and `>`, to compare chromosomes based on an assumed [lexicographical ordering](http://en.wikipedia.org/wiki/Lexicographical_order). That is, the following will be true (expected): return "chr1" < "chr2" yet, so will this: return "chr10" < "chr2" For example, here is a lexicographical ordering of the human autosomes, sex chromosomes, and the mitochondria: chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY Yet we often want to store our data in a more sensible manner such as: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM The issue is that because these are strings, the above order violates lexicographical ordering. Now, in this specific case, there are ways around the problem. We can write our own custom sort comparators that ignore leading alphabetical characters and instead compare chromosome names based on the business end of the label - that is, the part that smells like a number. In fact, that is what Heng Li has done in this function in samtools: static inline int strnum_cmp(const char *a, const char *b) { char *pa, *pb; pa = (char*)a; pb = (char*)b; while (*pa && *pb) { if (isdigit(*pa) && isdigit(*pb)) { long ai, bi; ai = strtol(pa, &pa, 10); bi = strtol(pb, &pb, 10); if (ai != bi) return aibi? 1 : 0; } else { if (*pa != *pb) break; ++pa; ++pb; } } if (*pa == *pb) return (pa-a) < (pb-b)? -1 : (pa-a) > (pb-b)? 1 : 0; return *pa<*pb? -1 : *pa>*pb? 1 : 0; } The problem I have with this approach is that it makes assumptions about the naming scheme. For the tools my lab develops and maintains, I am interested in **general** approaches that work no matter the naming scheme and the data ordering. I was particularly motivated by a recent [issue](http://code.google.com/p/bedtools/issues/detail?id=146) that was posted by a user of our [bedtools software](http://code.google.com/p/bedtools/). Basically, the user was trying to intersect genomic intervals from two large files stored in BED format using the `-sorted` option, which invokes a memory-efficient algorithm that can detect intersecting intervals from two files, provides that they are (you guessed it) *lexicographically* sorted. Well, the OP's data were not. In fact they weren't sorted at all: they were "grouped" - that is, all the records from each chromosome were group together, yet the order of which chromosome's data preceded the other followed no specific sorting criteria. Consequently, the user had to maintain multiple versions of the same data; the original version, the version sorted according to the rules used by bedtools, the rules for tool _X_, tool _Y_, and so forth. Maintaining multiple versions of the same data is wasteful. Moreover, sorting takes time and ideally, we'd like to take the [Ronco](http://en.wikipedia.org/wiki/Ron_Popeil) approach --- that is, **sort it and forget it** --- since the cost of sorting once is amortized over the number of times you get to exploit the benefits of pre-sorted data in your analyses. As a scientist, the lack of a clear standard pisses me off. As a developer, it inspires me to think of clever solutions to make my life, and that of others, easier. In an effort to provide a general approach to this issue, we have a nice prototype of a general algorithm for bedtools that will properly find intersections among multiple files no matter what sorting/grouping criteria one uses. There's just one catch: you can't mix criteria among the files. That would just be annoying. In a subsequent post, I will describe the approach and provide examples. // arq

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Deriving kinship coefficients from a VCF file.

(reposted from a post on my old blog which died in a departmental website reorganization)

I am currently investigating a rare phenotype among a cohort of ostensibly unrelated individuals. Together with a colleague here at the University of Virginia, I have performed deep exome sequencing of each individual in our cohort in an effort to identify new genetic variation that accounts for the phenotype. In so doing, we noticed (rather by chance) that two of our samples had strikingly similar genotypes at several loci that we were interested in. This led quickly to the suspicion of two possible scenarios. First, it might be that the two samples were closely related. Alternatively, it could be that the two sample DNAs were from the sample individual; in other words, there was a laboratory mix up that led to the same individual being sequenced twice.

Now, were the samples to have come from the same individual (or monozygotic twins), one would expect that well over 95% of the genotypes for the two samples would be identical (the remaining 5% being discordant owing to fluctuations in sequence coverage that would lead to missed heterozygote calls in one sample or the other). Yet if the samples are related, but there relatedness is more cryptic, we need a more powerful statistic to describe the level of relatedness between our samples. The kinship coefficient (f) is a measure of relatedness that represents the probability that two alleles, one sampled at random from each sample/individual, are identical by descent. For example, take a mother and child. If you sample an allele at given site from the child, there is a 0.50 probability that that allele came from the mother. Given that the allele came from the mother, there is 0.50 probability that the allele chosen at random in the mother is the same as the allele already chosen in the child. These are independent probabilities, so the kinship coefficient for a parent (P) and child (C) is:

f(PC) = 0.50 * 0.50 = 0.25

People are often confused by the fact that the kinship coefficient for a self-self comparison or a monozygotic twin comparison is 0.50. Yet recall that we are sampling alleles at random (with replacement) and testing for identity by descent. Thus, even for the same individual, we have a 0.50 probability of choosing the same parental allele twice.

The following table lists the kinship coefficient (and the degree of relatedness, r = 2*f) for several common cases.

Relationship Kinship coefficient Coeffcient of relatedness Self 0.5000 1.000 Monozygotic twins 0.5000 1.000 Parent-child 0.2500 0.500 Full siblings 0.2500 0.500 Half siblings 0.1250 0.250 First cousins 0.0625 0.1250 Unrelated 0.0000 0.0000

The kinship coefficient is extremely useful for identifying cryptic relationships among samples (e.g. population stratification, inbreeding, etc.) and as a potent means of quality control. For example, as I describe here, it can be used to identify identical samples, as well as unexpected relationship among ostensibly unrelated samples.

It turns out that my colleagues Ani Manichaikul and Wei-min Chen, both Assistant Professors in the Center for Public Health Genomics at The University of Virginia, recently published a new software package (KING) for rapidly computing the kinship coefficient (among many other useful statistics). However, KING does not directly support the VCF format as input. It turns out that one can get around this problem by using Adam Auton's excellent vcftools, PLINK, and then KING. Below are the necessary steps.

First use vcftools to convert the VCF file (call it example.vcf) to PLINK PED and MAP formats:

$ vcftools --vcf example.vcf --plink

By default, this will create two new files, out.map and out.ped. As KING will accept PLINK’s binary PED or BED (I know, it’s confusing given the UCSC BED format for defining intervals) format, we next use PLINK to convert the PED and MAP files to a single BED file:

$ plink --file out --make-bed

Use the genotypes present in the VCF (now binary PED or BED) file, we are now ready to use KING to estimate the kinship coefficient for our samples:

$ king -b plink.bed --kinship

By screening the output (I’ve truncated it for clarity and simplicity), we can see that KING reports, as the 8th column, the kinship coefficient between the samples in the VCF file. For example, whereas sample 7 (s7) has very little kinship with samples 11 and 13 (0.0013 and 0.0052, respectively), the kinship coefficient between samples 2 and 14 (0.4970) suggests that the two samples are from the same individual, or are monozygotic twins. In this case, the same DNA sample was mistakenly sequenced twice.

$ cat king.kin0 FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship s7 s7 s11 s11 245941 0.090 0.0694 0.0013 s7 s7 s13 s13 247179 0.088 0.0713 0.0052 ... s2 s2 s14 s14 251994 0.234 0.0071 0.4970

Opening for a genomics software engineer

My laboratory is seeking an experienced, creative, and highly motivated programmer / software engineer to fill a Scientific Programmer position in our computational genomics research group at the University of Virginia. If you are a talented programmer that is excited by challenging problems, developing innovative algorithms, and exploring complex datasets in hopes of understanding the genetic basis of disease, this is the job for you.

About the Quinlan Lab: Modern human genetics is very much a computational science. Our research group in UVA's Center for Public Health Genomics is focused on the development of scalable computational methods that will enable the large-scale genomics studies of the future. The computational demands of modern genetics research require innovative techniques and new algorithms that exploit parallel computing architectures. We are currently developing and applying our methods to several projects including: genomic data mining, genome architecture and mutational dynamics in brain cancer, the landscape of structural variation in human genomes, the genetics of Type 1 diabetes, and the functional implications of genetic variation in targeted sequencing applications. We seek an enthusiastic and experienced programmer, preferably with modern genomics research experience, to work with our team of scientists to develop new computational techniques and software for genomics research. The candidate will take a leading role in the design and implementation of these techniques and will have creative freedom to explore ideas of their own. Collaborative participation is also expected in the development of publications for peer-reviewed journals, as well as on grant proposals to funding agencies.

Required Qualifications:

Masters degree in Computer Science, Engineering, or a relevant discipline.

Substantial and proven experience in the design and implementation of software.

Experience and a deep understanding of the C, C++, or Java programming languages; C and C++ preferred.

Strong organizational and time management skills.

Independence, creativity and good communication skills.

Preferred Qualifications:

Experience in Bioinformatics or Genetics.

5 to 7 years of demonstrated programming and software design experience.

Experience with developing software exploiting parallel computing frameworks. Example frameworks include: OpenMP, MPI, Hadoop, CUDA, PThreads.

Experience writing publications from personal research with a record of publication in peer-reviewed journals.

Experience in genomics, if doctorate is in another field.

Interested applicants should: email their CV and a cover letter to: Dr. Aaron Quinlan (arq5x at virginia dot edu). http://cphg.virginia.edu/quinlan/

Measuring genetic distance with Python and NumPy

It is fascinating to think about the degree and origin of genetic differences among individuals. The whats and whys of our genetic differences are the consequence of complex human demography: we humans migrated in multiple waves out of Africa and the subsequent development of communal civilizations led to non-random mating and the greater tolerance of otherwise more deleterious mutations (that is, the comforts of home and communal societies allowed genetic mutations that would otherwise be selected against to "get by"). As a result, individuals within sub-populations (e.g., European Americans or Central Americans) are much more genetically similar to one another than individuals with different demographic histories. And intuitively, individuals within the same family are more similar than between families, etc. But how do we go about **quantifying** the genetic **distance** between two individuals? This question is now more relevant than ever, given the popularity of genetic testing services such as [23andMe](http://23andme.com/) - for example, have you too lamented the arrival of the first "how related are we app" and the first "will we make healthy babies app"? Surely this form of technological eugenics is on the horizon, save for legislative intervention. ### New technology Beyond the consumer market, geneticists now have the ability to sequence the DNA of a complete human genome for roughly $4000 dollars, and recent press releases suggest a genome will cost less than a memorable night on the town by the end of the year (see [CoreGenomics](http://core-genomics.blogspot.com/2012/01/hiseq-2500-whats-in-upgrade-and-first.html) and [GenomeWeb](http://www.genomeweb.com/sequencing/illumina-plans-mid-year-launch-genome-day-hiseq-2500-miseq-upgrade) for more details. As a result, we can now construct nearly complete catalogs of the genetic variation in many individuals for a minuscule fraction of the time and cost of even five years ago. In fact, in recent scientific meetings, I've heard other estimates that between 30 and 50 thousand human genomes will be sequenced in 2012 alone. The ability to collect exquisite genome-scale data for so many people is a boon for those interested in genetic variation and their bearing on traits and diseases susceptibility. Unfortunately, the underlying differences in human demography can lead to differences masquerading as variation that is correlated with a phenotype. Therefore, one of the standard analyses in large-scale genetic studies is to identify "clusters" of individuals (typically through principal components analysis) in the study that are genetically similar to one another. We often have some sense of this ahead of time, as we likely know a fair amount about the individuals in the cohort. However, there are often more subtle similarities (and differences) than we would have expected, and if we don't account for them, our inferences can be misleading or just wrong. The fundamental measure underlying these analyses is a measure of the "Euclidean" genetic distance between the individuals in the study. ### Genetic distance This brings us back to how to **quantify** the genetic **distance** between two individuals. As you might expect, it turns out this topic has been studied extensively. One of the simplest (and most effective) approaches is to compute a Euclidean distance between two individuals by summarizing the allelic "distance" between the individuals at a large number (typically a million or more) known bi-allelic sites. For example, imagine a given locus on chromosome 1 (say position 2,373,200) is known to have two different alleles (A and B) in the human population. Let's assume the A allele is the more frequent (major) allele, and B is the less common allele. Thus some individuals will be homozygous for the A allele, most will be heterozygous for the A and B allele, and a minority will be homozygous for the B allele. The typical way to encode these three different genotype possibilities is with numbers that reflect the number of minor alleles that an individual has. For example, * 0 indicates that the individual is homozygous for A * 1 indicates that the individual is heterozygous for A and B * 2 indicates that the individual is homozygous for B * -1 indicates that the genotype is missing/unknown Thus, as expected, if two individuals have the same genotype (say 2 and 2), the "distance" at that site is zero, and the greatest possible distance is 2 when one individual is AA (0) and the other is BB (2). The genetic distance statistic between two individual *i* and *j* is, in essence, computed by summing this genetic distance at all sites *M* where there is genotype *G* available for both individuals. $$d(i,j)=\frac{\sum_{m=1}^Mabs(G_{i,m} - G_{j,m})^{2}}{M}$$ ### The code: computing distances with Python and Numpy I've been working recently with several whole-genome datasets involving hundreds and even thousands of individuals. In studying these datasets, I have needed to compute the genetic distance between each pair of samples for the reasons outlined above. Given that there are N^2 / 2 pairwise distances to measure among N individuals, the speed of this calculation can matter quite a bit. To this end, below I compare using vanilla Python loops versus vectorized NumPy operations for calculating d(i,j) Let's setup the experiment by using Numpy arrays to simulate 10,000,000 genotypes for two individuals, Joe and Jim. Note that the *high* parameter is exclusive, so the randomly generated genotype values will include -1,0,1, and 2, inclusive (see above details about what the numbers represent). joe = np.random.randint(low=-1, high=3, size=10000000) jim = np.random.randint(low=-1, high=3, size=10000000) Now, recall the we only wan to compute the distance for sites where a valid (i.e., not -1) genotype is available for both individuals. To eliminate sites where this isn't the case, we can use the Numpy mask() function to create an array called both_mask indicating whether each of the 10,000,000 sites are valid (True) or not (False) in each individual. # which of joe's genotypes are valid? joe_mask = np.ma.masked_where(joe>=0, joe).mask # which of jim's genotypes are valid? jim_mask = np.ma.masked_where(jim>=0, jim).mask # which of both joe's and jim's genotypes are valid? both_mask = joe_mask & jim_mask Now, we define a function dist_loops() that uses plain Python loops to iterate through each valid site and compute the distance between two individuals, *ind1* and *ind2*. def dist_loops(ind1,ind2,both_mask): """ Compute d(i,j) between two sample genotype vectors using loops """ numerator = 0 denominator = 0 for i in xrange(0,len(ind1)): if both_mask[i]: denominator += abs(ind1[i]-ind2[i])**2 numerator += 1 return float(numerator)/float(denominator) Running this function for joe and jim takes 14.8 seconds on my Intel i7 Macbook Pro. While this may not seem like a long time, were we to use this function to compute pairwise distances for 1000 individuals (1e6 / 2 comparisons) would take **over 2000 hours (~3 months) on a single processor!** For me, this is motivation for improvement. from IPython.utils.timing import time start = time.time() dist_loops(jim, joe, both_mask) stop = time.time() ptime = stop-start print ptime I am fairly new to [Numpy](http://numpy.scipy.org/), but have experience with similar software such as [Matlab](http://www.mathworks.com/products/matlab/) and [R](http://www.r-project.org/). The beauty of these frameworks is that they are designed to compute vector and matrix operations using *vectorized* operations (for more details about vectorized operations, see this [stackoverflow thread](http://stackoverflow.com/questions/1422149/what-is-vectorization) or this [Wikipedia article](http://en.wikipedia.org/wiki/Vectorization_%28parallel_computing%29). The result is that for algorithms like this one, there are often tremendous performance gains to be had by operating on an entire array at once instead of looping over each element in the array as we did in the vanilla Python approach. The function below illustrates the vectorized Numpy strategy. def dist_numpy(ind1, ind2, both_mask): """ Use vectorized numpy ops to compute d(i,j) between two samplegenotype vectors """ numerator = float(np.sum(np.square((ind1-ind2)[both_mask]))) denominator = float(np.sum(both_mask)) return numerator / denominator Running this function for joe and jim takes 0.259 seconds on the same computer. In other words, it is roughly 60 times faster and would reduce the run time for 1000 individuals to about **36 hours on a single processor**. To me, that is a very impressive difference indeed. In summary, I like Numpy.

#genetics #code #MDS #distance #python #numpy

BEDTools Version 2.13.0

Owing to travel and time spent trying to get my new lab up and running, it's been a few months since the last BEDTools release (v2.12.0). Today I released version 2.13.0, which includes three new tools, as well as several useful new options and bug fixes. I am grateful to all the people that have helped with this release. I've tried to highlight their efforts below, but I am sure I've missed something. My apologies to those whose efforts I've missed. Please feel free to email me or the bedtools mailing list with suggestions or comments.

New tools

multiBamCov. This new tool counts sequence coverage for multiple position-sorted and indexed bams at specific loci defined in a BED/GFF/VCF file. In the example below, the last 3 columns represent the number of alignments overlapping each interval from the three BAM file. Also, multiBamCov works with a single BAM file and because each interval in the BED/GFF/VCF file is explored and reported in order, this serves as an alternative to coverageBed, which reports the output in a different order than the B file.

$ multiBamCov -bams aln.1.bam aln.2.bam aln3.bam -bed exons.bed chr1 861306 861409 SAMD11 1 + 181 280 236 chr1 865533 865718 SAMD11 2 + 249 365 374 chr1 866393 866496 SAMD11 3 + 162 298 322

The following options are available to control which types of alignments are are counted. Many thanks to Chip Stewart for the addition of the -D and -F options.

-q Minimum mapping quality allowed. Default is 0. -D Include duplicate-marked reads. Default is to count non-duplicates only -F Include failed-QC reads. Default is to count pass-QC reads only -p Only count proper pairs. Default is to count all alignments with MAPQ greater than the -q argument, regardless of the BAM FLAG field.

tagBam. This tool annotates a BAM file with custom tag fields based on overlaps with BED/GFF/VCF files. The default tag type is "YB", but with the -tag option, one can specify custom tag types. In the example below, for alignments that have overlaps, you should see new BAM tags like "YB:Z:exonic", "YB:Z:cpg;utr":

$ tagBam -i aln.bam -files exons.bed introns.bed cpg.bed utrs.bed -labels exonic intonic cpg utr > aln.tagged.bam

nucBed. This new tool profiles the nucleotide content of intervals in a fasta file. Thanks to Can Alkan for suggesting a header line. The following information will be reported after each original BED/GFF/VCF entry:

1) %AT content 2) %GC content 3) Number of As observed 4) Number of Cs observed 5) Number of Gs observed 6) Number of Ts observed 7) Number of Ns observed 8) Number of other bases observed 9) The length of the explored sequence/interval. 10) The sequence extracted from the FASTA file. (optional, if -seq is used) 11) The number of times a user defined pattern was observed. (optional, if -pattern is used.)

For example:

$ nucBed -fi ~/data/genomes/hg18/hg18.fa -bed simrep.bed | head -3 #1_usercol 2_usercol 3_usercol 4_usercol 5_usercol 6_usercol 7_pct_at 8_pct_gc 9_num_A 10_num_C 11_num_G 12_num_T 13_num_N 14_num_oth 15_seq_len chr1 10000 10468 trf 789 + 0.540598 0.459402 155 96 119 98 0 0 468 chr1 10627 10800 trf 346 + 0.445087 0.554913 54 55 41 23 0 0 173

One can also report the sequence itself:

$ nucBed -fi ~/data/genomes/hg18/hg18.fa -bed simrep.bed -seq | head -3 #1_usercol 2_usercol 3_usercol 4_usercol 5_usercol 6_usercol 7_pct_at 8_pct_gc 9_num_A 10_num_C 11_num_G 12_num_T 13_num_N 14_num_oth 15_seq_len 16_seq chr1 10000 10468 trf 789 + 0.540598 0.459402 155 96 119 98 0 0 468 ccagggg... chr1 10627 10800 trf 346 + 0.445087 0.554913 54 55 41 23 0 0 173 TCTTTCA...

Or, one can count the number of times that a specific pattern occur in the intervals (reported as the last column):

$ nucBed -fi ~/data/genomes/hg18/hg18.fa -bed simrep.bed -pattern CGTT | head #1_usercol 2_usercol 3_usercol 4_usercol 5_usercol 6_usercol 7_pct_at 8_pct_gc 9_num_A 10_num_C 11_num_G 12_num_T 13_num_N 14_num_oth 15_seq_len 16_user_patt_count chr1 10000 10468 trf 789 + 0.540598 0.459402 155 96 119 98 0 0 468 0 chr1 10627 10800 trf 346 + 0.445087 0.554913 54 55 41 23 0 0 173 0 chr1 10757 10997 trf 434 + 0.370833 0.629167 49 70 81 40 0 0 240 0 chr1 11225 11447 trf 273 + 0.463964 0.536036 44 86 33 59 0 0 222 0 chr1 11271 11448 trf 187 + 0.463277 0.536723 37 69 26 45 0 0 177 0 chr1 11283 11448 trf 199 + 0.466667 0.533333 37 64 24 40 0 0 165 0 chr1 19305 19443 trf 242 + 0.282609 0.717391 17 57 42 22 0 0 138 1 chr1 20828 20863 trf 70 + 0.428571 0.571429 10 7 13 5 0 0 35 0 chr1 30862 30959 trf 79 + 0.556701 0.443299 35 22 21 19 0 0 97 0

New options

Support for "named pipes" and FIFOs. My sincere thanks to Davide Cittaro, Michael Hoffman and Nate Weeks for the help in working this out. This allows things like:

$ intersectBed -a <(head a.gff) -b <(head b.gff) ### OR ### $ mkfifo tmp_pipe $ awk '$7 == 0' ID137308R292.cov > tmp_pipe & $ intersectBed -a <(awk '$7 == 0' ID137308.cov) -b tmp_pipe

All BEDTools now allow the use of "-" to indicate that data is being sent via stdin. In order to allow backwards compatibility, "stdin" is also allowed.

Multiple tools. Added new -S (that is, opposite strands) option to annotateBed, closestBed, coverageBed, intersectBed, pairToBed, subtractBed, and windowBed (-Sm). This new option does the opposite of the -s option: that is, overlaps are only processed if they are on opposite strands. Thanks to Sol Katzman for the great suggestion. Very useful for certain RNA-seq analyses.

coverageBed. Added a new -counts option to coverageBed that only reports the count of overlaps, instead of also computing fractions, etc. This is much faster and uses much less memory.

genomeCoverageBed. Added new -scale option that allows the coverage values to be scaled by a constant. Useful for normalizing coverage with RPM, RPKM, etc. Thanks to Ryan Dale for the useful suggestion. Added new -5, -3, -trackline, -trackopts, and -dz options. Many thanks to Assaf Gordon for these improvements.

-5: Calculate coverage of 5" ends (instead of entire interval) -3: Calculate coverage of 3" ends (instead of entire interval). -trackline: Adds a UCSC/Genome-Browser track line definition in the first line of the output. -trackopts: rites additional track line definition parameters in the first line. -dz: Report the depth at each genome position with zero-based coordinates, instead of zero-based.

closestBed. See below, thanks to Brent Pedersen, Assaf Gordon, Ryan Layer and Dan Webster for the helpful discussions.

closestBed now reports _all_ features in B that overlap A by default. This allows folks to decide which is the "best" overlapping feature on their own.

closestBed now has a "-io" option that ignores overlapping features. In other words, it will only report the closest, non-overlapping feature.

An example:

$ cat a.bed chr1 10 20

$ cat b.bed chr1 15 16 chr1 16 40 chr1 100 1000 chr1 200 1000

$ bin/closestBed -a a.bed -b b.bed chr1 10 20 chr1 15 16 chr1 10 20 chr1 16 40

$ closestBed -a a.bed -b b.bed -io chr1 10 20 chr1 100 1000

Updates

Updated to the latest version of BamTools. This allows greater functionality and will facilitate new options and tools in the future.

Bug Fixes

GFF files cannot have zero-length features.

Corrected an erroneous check on the start coordinates in VCF files. Thanks to Jan Vogel for the correction.

mergeBed now always reports output in BED format.

Updated the text file Tokenizer() function to yield 15% speed improvement.

Various tweaks and improvements.

#bedtools #v2.13.0

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Deriving kinship coefficients from samples and genotypes in a VCF file.

fPC = 0.50 * 0.50 = 0.25.

The following table lists the kinship coefficient (and the degree of relatedness, r = 2*f) for several common cases.

It turns out that my colleagues Ani Manichaikul and Wei-min Chen, both Assistant Professors in the Center for Public Health Genomics at The University of Virginia, recently published a new software package (KING) for rapidly computing the kinship coefficient (among many other useful statistics). However, KING does not directly support the VCF format as input. It turns out that one can get around this problem by using vcftools, PLINK, and then KING. Below are the necessary steps:

First use vcftools to convert the VCF file (call it example.vcf) to PLINK .PED and .MAP formats:

$ vcftools --vcf example.vcf --plink

$ plink --file out --make-bed

Use the genotypes present in the VCF (now binary PED or BED) file, we are now ready to use KING to estimate the kinship coefficient for our samples:

$ king -b plink.bed --kinship

$ cat king.kin0 FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship s7 s7 s11 s11 245941 0.090 0.0694 0.0013 s7 s7 s13 s13 247179 0.088 0.0713 0.0052 ... s2 s2 s14 s14 251994 0.234 0.0071 0.4970

#VCF #kinship #vcftools #PLINK #genetics #code

Ambition

Student: I'm looking for some advice on genome simulation.

Me: Well, I might be able to help. What exactly are you trying to simulate?

Student: EVERYTHING!

Trending Blogs

Last Seen Blogs

genetic.code