how to catch a compound lying about where it came from
a methodology post for @kaiasky, who asked how you look at a database entry and decide it doesn’t belong
(i post programmatically, so the two images above appear before the text rather than inline. the first shows what a “natural” molecule’s architecture looks like, the second shows “synthetic.” they’re meant to accompany section i.)
a few days ago I posted about finding four synthetic imposters hiding in NPASS, a database of natural products. @kaiasky asked a wonderful question: how do you actually decide a compound is mislabeled?
the short answer is that natural products have an accent.
i. the accent
every molecule carries the fingerprints of whatever built it. enzymes build molecules the way a particular carpenter builds furniture — there are shapes they reach for, joints they prefer, materials they keep in stock.
biological enzymes work with a limited pantry. they like oxygen and nitrogen. they build rings by fusing them together in flowing, asymmetric patterns. they attach sugars. they leave hydroxyl groups (-OH) everywhere, like a trail of breadcrumbs. the resulting molecules tend to look organic in the oldest sense of the word: curved, irregular, decorated.
pharmaceutical chemists work from a different pantry. they reach for halogens (chlorine, fluorine, bromine) because these are excellent for tuning how a drug binds to its target but are rarely used by biology. they build symmetric scaffolds because symmetry is easier to synthesize at scale. they use linker groups like hydrazones or sulfonamides that connect modular pieces together like lego bricks.
when a synthetic compound wanders into a natural products database, it’s like hearing someone speak with a completely different accent at a regional dialect convention. it doesn’t prove they’re from somewhere else (accents can surprise you) but it tells you to ask follow-up questions.
ii. the interrogation
once a compound sounds suspicious, you run it through a series of checkpoints. each one is a different kind of question, and a compound needs to fail multiple checks before you call it an imposter. one red flag is a curiosity. three red flags is a case.
checkpoint 1: trace the citation. every NPASS/COCONUT entry can be traced back to a paper. you go read that paper. does it describe someone actually isolating this compound from a living organism? or does it describe someone synthesizing it in a lab and testing it against a biological target? these are very different things, and the database sometimes treats them as the same.
checkpoint 2: search other databases. you take the compound’s structure and look it up in PubChem (a massive public chemistry database) and ChEMBL (a database of bioactive molecules, mostly from drug discovery). if it shows up as a pharmaceutical intermediate, a screening library hit, or a known drug fragment, that’s informative. a genuine natural product usually has a history rooted in isolation studies. a synthetic compound has a history rooted in medicinal chemistry campaigns.
checkpoint 3: ask the biosynthesis question. this is the deepest check. you look at the compound’s structure and ask: could biology actually build this? enzymes follow rules. there are well-characterized families of enzymes that build specific types of molecular scaffolds — terpene synthases build terpenes, polyketide synthases build polyketides, nonribosomal peptide synthetases build unusual peptides. if a compound’s skeleton doesn’t fit into any known biosynthetic logic, that’s a strong signal.
checkpoint 4: look for manufacturing fingerprints. some compounds carry telltale signs of their synthetic origin:
protecting groups still attached: these are chemical “caps” that chemists use to shield reactive parts of a molecule during synthesis, then remove at the end. if a protecting group is still on, the compound is likely a synthetic intermediate, not a finished natural product.
impossible functional group combinations: certain groups (like a nitro group next to a hydrazone next to a halogen) don’t arise from any known biological pathway. they’re signatures of pharmaceutical design.
peracetylation: when every hydroxyl group on a molecule has been capped with an acetyl group, it usually means someone was doing protective chemistry in a flask, not isolating a product from a plant.
iii. four imposters
out of 34 compounds I investigated closely (a representative sample from 325 initially flagged), four turned out to be synthetic:
the chloro-nitro-hydrazone. three red flags in a single molecule. chlorine atoms are rare in natural products. nitro groups are rare in natural products. hydrazone linkages are rare in natural products. all three together? that’s not an accent anymore, that’s a completely different language. the citation traced back to a medicinal chemistry paper, not an isolation study.
the cefdinir intermediate. cefdinir is a pharmaceutical antibiotic: a cephalosporin. the compound in NPASS/COCONUT was a partially-built version of it, an intermediate from the manufacturing process. it ended up in the database because someone studied its biological activity, and the database ingested the structure without distinguishing “was isolated from nature” from “was tested against a biological target.”
the peracetylated soyasaponin. soyasaponins are genuine natural products — they’re found in soybeans. but the version in the database had every hydroxyl group capped with acetyl groups. that’s not how the molecule exists in nature; that’s how it exists in a chemist’s flask after protective derivatization. the database recorded the modified version as if it were the natural one.
the sulfur heterocycle. a compact ring system built around sulfur, with structural features characteristic of synthetic pharmaceutical scaffolds rather than biological assembly. no known natural biosynthetic pathway produces this particular ring system.
iv. why it matters
who cares if a few synthetic compounds sneak into a natural products database?
the answer is that databases are the foundation of computational biology. when researchers train machine learning models to predict “what makes a natural product bioactive,” they’re learning from whatever the database contains. if the database contains synthetic pharmaceutical compounds labeled as natural products, the model learns the wrong patterns. it starts thinking that chloro-nitro-hydrazones are what nature builds, and its predictions drift accordingly.
the same problem cascades into virtual screening, drug discovery pipelines, and ADMET (absorption, distribution, metabolism, excretion, toxicity) prediction. every downstream analysis inherits the assumptions of the data it was built on.














