GSoC certificate

seen from Germany
seen from Netherlands
seen from Canada
seen from Ireland
seen from Russia

seen from United States

seen from India
seen from Germany

seen from Kazakhstan
seen from South Korea

seen from Malaysia

seen from Malaysia
seen from Australia
seen from Japan

seen from New Zealand

seen from Mexico

seen from Japan
seen from Bangladesh
seen from United States
seen from China
GSoC certificate

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
GSoC t-shirt
And the summer ends
The coordinate mapper, with updated documentation, is now located on [this branch](https://github.com/lennax/biopython/tree/f_loc4), awaiting the merging of Peter's f_loc4 branch. I've written an [entry](http://biopython.org/wiki/Coordinate_mapping) on coordinate mapping for the Cookbook. Additionally, at Peter's suggestion, I've written a clarification of strand as it relates to transcription and translation. It's available [here](https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit). It's been a great experience working with this project this summer. Thank you to everyone involved.
Strand
The summer is winding to a close. I've spent this week busy with orientation events and meetings for my upcoming PhD program. I hope to have time to continue to contribute to Biopython in my spare time, and ideally I would like to use and expand Biopython as a portion of my research. I have been considering how to handle gene strandedness. As long as I'm correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features. GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10 I have to admit that it wasn't until I read a BioStar [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I've made a mistake in the above code. `c2g` yields the correct genome position, but not the strand. I still need to integrate strand information into my `GenomePosition` object and/or partially merge it with `ExactLocation`. This weekend I intend to expand documentation and write a brief cookbook entry.
Coordinate Mapping update
Following extensive [discussion](http://biopython.org/pipermail/biopython-dev/2012-August/009849.html) on the dev list of the pros and cons of configuration classes/modules, I have refactored my [coordinate mapper](https://gist.github.com/3172753) to keep configuration as isolated as possible. All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details. There are two different ways a user can convert a coordinate from HGVS: # ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs() Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation: # Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs The [revision](https://gist.github.com/3172753/577b7c383e057b78cdcee64be33f18117a46faaf) as of this writing is passing tests using base 0. I have not yet implemented tests for `from_hgvs` or `to_hgvs`, but that's next on my list. I'm hoping to have time for strand and mixed strand, too. Update: The latest [revision](https://gist.github.com/3172753/7c5f285634b124b2ba2f65fd96114c441382a12d) now tests with default settings and HGVS settings.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Coordinate Mapping
I have been expanding the [coordinate mapper](http://lists.open-bio.org/pipermail/biopython/2010-June/006598.html) Reece posted to the dev list a couple of years ago. It's currently living as a [gist](https://gist.github.com/3172753), although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of "test-driven development," but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy [before](https://gist.github.com/3172753/f12878bb9d34c524f7427fe7d0bde5747e7eb6d1) I started testing it with 1-based output. (More on that below.) I have modified it to work with the new `CompoundLocation` Peter is working on, while retaining the ability to provide exons as a sequence of pairs. ### Representing intron locations ### One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS [conventions](http://www.hgvs.org/mutnomen/refseq_figure.html), in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I'm using a custom object called `NonExonPosition`. Are there standards that use an alternative display for intron positions relative to CDS coords? ### Configuration ### I'm currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few). However, I have read a variety of StackOverflow threads that imply that if I'm trying to use globals or a singleton class, I'm doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a "more Pythonic" way to handle module-wide configuration? On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I'll likely use a class similar to `ExactPosition` (i.e. subclass `int`). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats? ### Other considerations ### * I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string '-1' to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn't make biological sense. * I haven't yet tackled circular genomes. Are they handled by `SeqFeature`/`FeatureLocation` yet? In GFF, the end is represented as an index greater than the "length" of the entire sequence. In order to handle this, it should only be a matter of `except`ing an `IndexError`. * I also haven't considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated. * Are there Biopython objects other than `FeatureLocation` that should be transformable? For example, is it worth attempting to map non-exact locations across coordinates? ------- All that said, I hope to have a production-ready coordinate mapper by the end of the week. Mailing list: * http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009847.html * http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009849.html
Week 8.5
I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from [Brad's GFF parser](https://github.com/chapmanb/bcbb/tree/master/gff "github"); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', [example](https://gist.github.com/3161269 "github gist")) could overcome this issue. If anyone can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the [coordinate mapper](http://biopython.org/pipermail/biopython/2010-June/006598.html) Reece posted to the dev list a couple years ago. Expect an update on that very soon.
Week 7: Variations
The highlight of this week was getting strep throat. I also gave myself a quick lesson in setuptools (working on an unrelated project), which is helpful for understanding how various projects do their building and testing. The fragmentation of standards is certainly not one of the things I love about Python (e.g. getopt/optparse/argparse; distutils/setuptools/distribute/distutils2). Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I've started a very minimal Variant object ([source](https://github.com/lennax/biopython/blob/variant2/Bio/Variant/variant.py)), using `FeatureLocation` for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1. I've created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. Please let me know if there are any errors in my interpretation of these variant types. ### single SNP: g.303G>C #CHROM POS ID REF ALT [...] 19 303 . G C [...] (DNAaccession, 302, 303, G, C) ### compound het: g.[303G>A];g.[303G>C] #CHROM POS ID REF ALT [...] FORMAT SAMPNAME 19 303 . G A,C [...] GT 1|2 [(DNAaccession, 302, 303, G, A), (DNAaccession, 303, 304, G, C)] phases = [True] ### RNA variant: r.78u>a # can VCF represent RNA? (RNAaccession, 77, 78, u, a) ### variant in CDS coords (coding region): c.3G>C # can VCF represent CDS? (CDSaccession, 2, 3, G, C) # Given the FeatureLocations of the CDS, convert to genomic coords ### variant in protein: p.Trp26Cys # can VCF represent CDS? (proteinaccession, 25, 26, Trp, Cys) ### trinucleotide repeat: g.77_79dup #CHROM POS ID REF ALT 19 79 . G GCTG # OR 19 77 . CTG CTGCTG (DNAaccession, 79, 79, "", "CTG") # OR (DNAaccession, 76, 79, CTG, CTGCTG)