What is in a (chromosome) name?
Have you ever been incensed by the ridiculous number of chromosome naming and ordering schemes that exist in genomics? If the answer is "no", then either you are an incredibly patient person, you enjoy unnecessary chaos, or you just haven't done any detailed analysis of genomics datasets. One of the main research areas in my lab is the development of well-tested, easy-to-use software for genomic research (I'll be the first to admit that we have room to improve...). The pursuit of the "easy to use" goal is complicated by many factors such as the size and complexity of the datasets we deal with. Bad algorithms that happily succeed with smallish datasets quickly fail miserably when we encounter, to quote Titus Brown, "datasets of abnormal size". But perhaps the most annoying complexity comes from the lack of a standard for naming and ordering chromosomes. For example, the largest human chromosome is often named "chr1", "1", "Chr1". Further still, as Deanna Church pointed out to me on Twitter, we could use a GenBank ID "CM000663.1" or the RefSeq ID "NC_000001.10" [see NCBI for details](http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/).
@aaronquinlan better yet, take advantage of the robust data management you get from GenBank and use accession.versions: ncbi.nlm.nih.gov/assembly/GCF_0…
January 29, 2013
Myriad naming systems pose a more vexing problem for algorithm development. Specifically, one way for genomics tools to keep pace with data scale is to employ algorithms that exploit *pre-sorted* data. When data is pre-sorted, one can often avoid loading full datasets into memory and instead, "sweep" through the data and do one's work on "the fly" (e.g., the "chromsweep" algorithm invoked by the `-sorted` option in bedtools). However, the crucial issue is that most cases, we have to test if one chromosome is before or after another; thus, we must know what the sorting order is. Because of the diverse naming schemes above, most tool developers store chromosome names internally as strings. That is, whether it is "chr1" or truly the number 1, it is stored as "chr1" or "1" because we cannot predict ahead of time what naming scheme you have chosen. Annoyed yet? In almost every programming language that I know of, storing chromosome names as strings forces code, by default using operators such as `==`, `<`, and `>`, to compare chromosomes based on an assumed [lexicographical ordering](http://en.wikipedia.org/wiki/Lexicographical_order). That is, the following will be true (expected): return "chr1" < "chr2" yet, so will this: return "chr10" < "chr2" For example, here is a lexicographical ordering of the human autosomes, sex chromosomes, and the mitochondria: chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY Yet we often want to store our data in a more sensible manner such as: chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM The issue is that because these are strings, the above order violates lexicographical ordering. Now, in this specific case, there are ways around the problem. We can write our own custom sort comparators that ignore leading alphabetical characters and instead compare chromosome names based on the business end of the label - that is, the part that smells like a number. In fact, that is what Heng Li has done in this function in samtools: static inline int strnum_cmp(const char *a, const char *b) { char *pa, *pb; pa = (char*)a; pb = (char*)b; while (*pa && *pb) { if (isdigit(*pa) && isdigit(*pb)) { long ai, bi; ai = strtol(pa, &pa, 10); bi = strtol(pb, &pb, 10); if (ai != bi) return aibi? 1 : 0; } else { if (*pa != *pb) break; ++pa; ++pb; } } if (*pa == *pb) return (pa-a) < (pb-b)? -1 : (pa-a) > (pb-b)? 1 : 0; return *pa<*pb? -1 : *pa>*pb? 1 : 0; } The problem I have with this approach is that it makes assumptions about the naming scheme. For the tools my lab develops and maintains, I am interested in **general** approaches that work no matter the naming scheme and the data ordering. I was particularly motivated by a recent [issue](http://code.google.com/p/bedtools/issues/detail?id=146) that was posted by a user of our [bedtools software](http://code.google.com/p/bedtools/). Basically, the user was trying to intersect genomic intervals from two large files stored in BED format using the `-sorted` option, which invokes a memory-efficient algorithm that can detect intersecting intervals from two files, provides that they are (you guessed it) *lexicographically* sorted. Well, the OP's data were not. In fact they weren't sorted at all: they were "grouped" - that is, all the records from each chromosome were group together, yet the order of which chromosome's data preceded the other followed no specific sorting criteria. Consequently, the user had to maintain multiple versions of the same data; the original version, the version sorted according to the rules used by bedtools, the rules for tool _X_, tool _Y_, and so forth. Maintaining multiple versions of the same data is wasteful. Moreover, sorting takes time and ideally, we'd like to take the [Ronco](http://en.wikipedia.org/wiki/Ron_Popeil) approach --- that is, **sort it and forget it** --- since the cost of sorting once is amortized over the number of times you get to exploit the benefits of pre-sorted data in your analyses. As a scientist, the lack of a clear standard pisses me off. As a developer, it inspires me to think of clever solutions to make my life, and that of others, easier. In an effort to provide a general approach to this issue, we have a nice prototype of a general algorithm for bedtools that will properly find intersections among multiple files no matter what sorting/grouping criteria one uses. There's just one catch: you can't mix criteria among the files. That would just be annoying. In a subsequent post, I will describe the approach and provide examples. // arq
















