Parsing ISBD. part II, Contextualizing MARC Data
When I resumed blogging last year, I had aimed to post at least a couple of times per month. It was an ambitious goal and I did not succeed; hopefully this year will be better?
Anyway, please enjoy the following long overdue conclusion to our ISBD parsing discussion, rescued from my drafts. I wrote this many months ago, and have since been working on deepening my understanding of serials cataloguing, but I'm going to publish this post as is so that we can move on to other MARC topics next time!
Though the parser combinators we talked about in the last post are powerful, we sometimes need more context when making ISBD parsing decisions. Consider the following two examples:
Abstracts of Bulgarian scientific literature. Mathematics, physics, astronomy, geophysics, geodesy / Bulgarian Academy of Sciences, Centre for Scientific Information and Documentation.
MInd, the meetings index. Series SEMT, Science, engineering, medicine, technology.
The first is for one set of volumes (titled Mathematics, physics, astronomy, geophysics, geodesy) of a multipart monograph (Abstracts of Bulgarian scientific literature); the second is for a series titled Science, engineering, medicine, technology, designated by the series name Series SEMT, within the journal MInd, the meetings index.
The data upto the first period in both cases denotes the "common title" of each work, but it's what follows that's interesting. There are two possible ISBD patterns that can be applied here, based on the grammar alone:
Common title. Dependent title designation, Dependent title
Common title. Dependent title
As you can see, the commas in the dependent title of both examples make it ambiguous as to which way they should be parsed. Technically, there's no reason why "Mathematics" in the first title couldn't be parsed as a dependent title designation, even though we can tell from our understanding of English that that isn't correct. Our parser doesn't understand natural language, though; it needs some simpler way to decide which rule to apply.
There are two different ways to parse MARC title data, and as this example shows, neither on its own is the right way. A MARC 245 field can be parsed according to its subfields, or according to its ISBD grammar as we've been doing, but these parses are non-composable (the elements extracted by each parse do not always line up with each other), and kind of orthogonal to each other (each may capture something that the other misses).
As I've said before, context-sensitive parsing allows us to feed extra information to the parser to help "contextualize" its parsing decisions. In this case, we want to allow our ISBD parser to access and work with the subfields-based parse when parsing the whole title statement. (Note that this is very different from breaking a field into its subfields and then trying to analyze the grammar within each subfield; instead we're keeping the data intact and looking at the ISBD structure in parallel to the subfields structure.)
Going back to the dependent title designation (DTD) problem, we could use this idea to define a subparser as follows:
Look for a candidate DTD in the form of a string followed by a comma.
Consult the subfields parse to see if there are any subfield n values from our current parse position that match the candidate DTD.
If there are, update the parse position in both the ISBD and the subfields parse, and return a successful match. If there are no values in the MARC data that match our candidate, then return a failure to match, which will allow the parser to backtrack and try a different pattern.
With this logic, when the parser attempts to match "Mathematics" as a DTD, it will fail to find |nMathematics in the MARC data (because that string is part of subfield p), and will instead correctly use the second pattern above.