Let’s Play with Language
So I want to find an acronym that’s four or five characters long and contains the letters “D”, “F”, and “A”. I specifically want it to be an acronym. That is, I want a pronounceable string of characters where each character represents a word (like WHO or OSHA), as opposed to an initialism where each character is pronounced (like CDC or BBC).
The quickest way to do this would just be to consider all actual words that meet my requirements. Because they’re actually words, they’re guaranteed to be pronounceable (think WHO). I used ack to search through the list of words available at /usr/share/dict/words on Unix-like systems based on my requirements: [$ cat /usr/share/dict/words | ack “d|D” | ack “f|F” | ack “a|A” | ack “^.{4,5}$” ] This produces a list of 36 words.
There are some decent contenders in that list, but I would also be willing to accept a string of characters that isn’t a word, so long as it’s pronounceable (think OSHA). To do this, I could write a script to generate all of the possible four and five character strings that contain “D”, “F”, and “A”, then go through and pick out pronounceable strings. Of course, this would mean going through thousands of strings, and I really don’t want to do that. I could also generate the list, look for character strings that are unpronounceable, and modify my script to weed out strings that contain those substrings. But that still doesn’t seem to be an effective way to go about this task.
I happen to know that words tend to be made up of smaller structures of n characters called n-grams, most importantly bigrams and trigrams. At first I thought I could examine common bigrams containing the characters in which I am interested, and manually generate possible strings. For example, for the bigram “nd”, there is only one four-character string meeting my requirements: “fand” (“afnd”, “andf”, “fnda”, “ndfa”, and “ndaf” would be a stretch to call pronounceable). However, once I decided to include five character strings, and for any bigrams made of two characters from my requirements list, i.e. any strings where there is a “wild card” character where all 26 letters have to be considered, the number of combinations became too much for me to easily manage manually.
I decided the way to go about generating a good list of candidates would be to consider common bigrams for each character on my requirements list, examine them to determine “rules” for their use, and write a script that generates every string of length four and five containing at least one “D”, “F”, and “A”, and then eliminate strings that break the rules. Whether or not this is ultimately the best way to go about this task, at this point I am interested in the idea of bigram usage rules, so this is how I’ll proceed.
Let’s Start with “F”
To start, I’ll get all the words from /usr/share/dict/words that contain an “F”, but that gets me over 21,000 results. I’m going to consider any bigram containing a vowel as valid, and won’t worry about them for now. Eliminating words where the “F” is bordered on both sides by a vowel or the edge of the word, I’m left with just over 10,000 results (I’m counting “Y” as a vowel for my purposes). Finally, let’s limit that to words that are seven characters or under. [ $ cat /usr/share/dict/words | ack "f|F" | ack -v "(a|A|e|E|i|I|o|O|u|U|y|Y|^)(f|F)(a|A|e|E|i|I|o|O|u|U|y|Y|$)" | ack "^.{1,7}$" > ~/Desktop/f.txt" ] This gives us 2058 words, which is workable, but should be plenty to discern patterns.
Of those words, 532 of them (25.85 %) contain the bigram “fl”. Of those, 332 have “fl” at the beginning of the word. Those in turn break down into 100 “fla” words, 64 “flo” words, 60 “flu” words, 51 “fle” words, 35 “fli” words, and 22 “fly” words. From this we can discern our first rule:
When the bigram “fl” is at the beginning of a word, it precedes a vowel.
The remaining words containing “fl” are similarly split into 83 “fle” words, 35 “fly” words, 32 “flo” words, 25 “fla” words, 15 “flu” words, and 10 “fli” words. Before we go any further, we can change our first rule:
When the bigram “fl” is at the beginning of a word, it precedes a vowel.
The bigram “fl” always precedes a vowel.
While sorting the words, I noticed another pattern: every “fly” word from the list of words where “fl” did not begin the word ends in “fly”. Therefore:
The trigram “fly” always begins or ends a word.
Also, the words that did not begin with “fl” also did not end with “fl”. Therefore:
The bigram “fl” never ends a word
Let’s move on to another bigram. “ft” seems fairly common, with 206 occurrences. Unlike “fl”, which occurred mostly at the beginning of words, “ft” tends to appear at the end of words, and never at the beginning. So:
The bigram “ft” never begins a word.
There are very few words in which “ft” is not immediately preceded by a vowel. They are offtake, offtype, delft, and twelfth. So let’s say:
The bigram “ft” is always preceded by a vowel, the trigram “elf”, or the bigram “of”.
“ft” also has a tendency to precede a vowel, but there are too many exceptions (15) for me to consider it a rule.
At this point I’ve sunk several hours into this, and have barely gotten through two bigrams for one letter. My curiosity remains piqued in regards to rules that make character strings pronounceable or not, but my approach definitely needs reexamination. I’ll definitely be returning to this later. However, to review what I did uncover:
The bigram “fl” always precedes a vowel.
The bigram “fl” never ends a word
The trigram “fly” always begins or ends a word.
The bigram “ft” never begins a word.
The bigram “ft” is always preceded by a vowel, the trigram “elf”, or the bigram “of”.














