User Input, Regex, Linguistics, and the Power of the Internet
I have been working on a new app and as it tends to happen, a routine requirement (parsing user input) quickly became a 2 day conundrum.Â
I considered the parsing task complete over the weekend and rewarded myself with a swim in the ocean and a delicious IPA. But then...Â
All hail/Ah hell... User Input!Â
As a quick reference/aside, User Input is like the hair of my sisterâs high school boyfriend â unpredictable and sometimes unkempt (see: 80âČs Jon Bon Jovi).
Over the past few mornings (mornings start proportionately earlier as a task becomes a roadblock), I hit a wall trying to figure out how to simply get around a common task:
Task: Split user input by multiple common separators: commas, tabs, spaces, linebreaks/returns, semicolons, colons (some people are into that) Example: User enters names: Tom Tomato, Tim Tomillio, Todd Thoas Solution: Javascriptâs handy Split() Method Output As: ⹠Tom Tomato âą Tim Tomillio âą Todd Thoas
Easy enough.
âBut waitâ, my wife seems to eagerly point out, âWhat if I want to enter a formal name that includes a shortened version or a nickname... like yours?â F. TouchĂ© ladybird, touchĂ©.Â
Okay, so shouldnât be that hard:Â
Updated Task: Modify the parser to account for common ways people type these out. Example: Thomas (Tom, Tmoney), Timothy (Tim Timmy), Todd (Toodles) Solution: ... many attempts ... many failures.Â
Whatâs the issue?Â
Todd (Toodles), â None. Passes easily. The match finds the comma and separation is correct.
Timothy (Tim Timmy), â One. The space after â(Timâ is causing âTimmy)â to output separately. So we remove the option in our split of separating User Input by spaces, the match finds the comma and separation is correct. Sweet.
Thomas (Tom, Tmoney), â One and it becomes a bricklayer. The match ignores the space between â(Tom,â and âTmoney),â â so that is a success. However, the match finds the comma between "(Tomâ, and âTmoney)â and outputs âThomas (Tomâ and âTmoney)â separately. Turns out, this is the jackleg.
So what is the issue?
Commas in development are cunning Linguists!
Commas are one of the (if not the most) common separators in the history of User Input. This is not based on facts, but more a logical assumption. So they are here to stay.Â
This led to a series of clever attempts, scientific regular expression theory, work-around hacks, brute force, and finally a resolve thanks to the brilliant mind of this regex genius who was kind enough to help someone who was in a similar situation.
Avinashâs Regex Solution:
,(?![^()]*(?:\([^()]*\))?\))
His Regex Explanation:
// , Â Â Â Â Â Â ',' // (?! Â Â Â Â Â look ahead to see if there is not: // Â [^()]* Â Â any character except: '(', ')' (0 or more times) // Â (?: Â Â Â Â group, but do not capture (optional): // Â Â \( Â Â Â '(' // Â Â [^()]* Â any character except: '(', ')' (0 or more times) // Â Â \) Â Â Â ')' // Â )? Â Â Â Â end of grouping, ? after the non-capturing group makes the whole non-capturing group as optional. // Â \) Â Â Â Â ')' // ) Â Â Â Â Â Â end of look-ahead
His Example on Regex101 His Answer on Stack Overflow
The implementation that works for my particular brickhouse:
var cleanUserInput = str.replace(/,(?![^()]*(?:\([^()]*\))?\))/gm, '\r'); return cleanUserInput.split(/[\r\n;]+/);














