Sanitizing user data: How we do it for Facebook and Twitter accounts
I’m Matt Fitzgerald and I’m a Software Engineer @ OS. OpenSky is a portal and as a portal, we injest a lot of content. As a result, we face a lot of problems around inconsistent user data. Often I’ll be tasked with cleaning up these issues after they pop up.
Recently, we discovered that our business entities (Sellers and Suppliers) had a lot of junk data in their OpenSky profile regarding information on their Facebook and Twitter accounts. We include links on the OpenSky platform that direct Customers to the Seller’s / Supplier’s profile page on Facebook and Twitter. Upon investigation, we found our Sellers / Suppliers were often confused when we asked them to save these identities into their OpenSky profile. As a result, a lot of Sellers / Suppliers, were entering incomplete or inconsistent data making these links confusing destinations for the Customer.
We went back and reexamined several parts of our application to see where the confusion began. The first pain point we identified was during our on-boarding process for Sellers / Suppliers. When Sellers / Suppliers entered their Facebook and Twitter accounts we simply required the input to be valid URLs that contained either “facebook.com” or “twitter.com” in them. Unfortunately for OpenSky, many of our users don’t really understand what a valid URL even means and as a result of requiring these valid pieces of data, we prevented a high percentage of users from successfully entering our community.
During testing, we found our users would enter a URL like www.facebook.com/The.OpenSky.Project and get rejected because only URLs with protocol (http://www.facebook.com/The.OpenSky.Project) are valid. In practice, the latter entry contained enough information for our application to correct the mistake, however, we simply stopped the user and advised them with error messages to correct the mistake. At first glance, this seemed reasonable, making the users enter valid data, however, on reflection we could sanitize a user’s data and keep them moving through our application with less impediments.
To make matters worse, other users understood their Facebook and Twitter Identities based on a variety of vanity urls, profile id’s, usernames, and SMS @reply tags. For instance, we asked people to enter a URL for their Twitter Identity like http://twitter.com/openskyproject and many only understood their Twitter Identity to be like @openskyproject. We even had problems with users entering links to their Facebook Photo Galleries rather than their main profile page.
We decided to remove these roadblocks by writing sanitizers that “smartly” extracted the relevant identity from all of the ways an identity can be known and saving only the relevant portion to our database.
Valid Facebook and Twitter Identities
Facebook Vanity URLs: http://www.facebook.com/The.OpenSky.Project
Facebook Profile IDs: http://www.facebook.com/profile.php?id=198143405897
Facebook Photo Gallery: http://www.facebook.com/profile.php?id=198143405897#!/The.OpenSky.Project?v=photos&ref=mf
Twitter SMS @reply: (@openskyproject)
Twitter User Name URL: http://twitter.com/openskyproject
Overview
The following solution involves PHP, Regular Expressions, and Symfony 2.0.
Sanitizing the Data
The first thing we had to do was write custom Validators in Symfony 2 to correctly extract a Facebook and Twitter Identity. Since there were many ways to build these identities, we had to search the data for the relevant portion and extract them. The second thing we did was pass these patterns to custom Form Fields. These form fields would first remove Facebook and Twitter domains, because these were actually not relevant to the validation process (The exact opposite of what we had first thought). So before the data was even processed, the value of the input field woudl be pre-formatted.
class FacebookIdentity extends \Symfony\Components\Validator\Constraints\Regex { const GROUP = 'group\.php\?gid=\d+'; const PAGE = 'pages\/[a-zA-Z0-9_\-]+\/\d+'; const PROFILE = 'profile\.php\?id=\d+'; const USERNAME = '[a-zA-Z0-9\.]{1,50}'; ... some other Symfony 2 requirements here ... public function __construct($options = null) { parent::__construct($options); $this->pattern = sprintf('/^(%s|%s|%s|%s)$/', self::GROUP, self::PAGE, self::PROFILE, self::USERNAME); } ... some other Symfony 2 requirements down here ... } class FacebookIdentityField extends \Symfony\Components\Form\TextField { const STRIP = '/.*facebook\.com/'; protected function processData($data) { if (empty($data)) { return $data; } $strippedData = preg_replace(self::STRIP, '', $data); if (preg_match(sprintf('/%s/', FacebookIdentity::GROUP), $strippedData, $matches)) { return $matches[0]; } if (preg_match(sprintf('/%s/', FacebookIdentity::PAGE), $strippedData, $matches)) { return $matches[0]; } if (preg_match(sprintf('/%s/', FacebookIdentity::PROFILE), $strippedData, $matches)) { return $matches[0]; } if (preg_match(sprintf('/%s/', FacebookIdentity::USERNAME), $strippedData, $matches)) { return $matches[0]; } return $data; } }
class TwitterIdentity extends \Symfony\Components\Validator\Constraints\Regex { const USERNAME = '[a-zA-Z0-9_\-]{1,13}'; public function __construct($options = null) { parent::__construct($options); $this->pattern = sprintf('/^%s$/', self::USERNAME); } ... some other Symfony 2 requirements down here ... } class TwitterIdentityField extends \Symfony\Components\Form\TextField { const STRIP = '/.*twitter\.com/'; protected function processData($data) { if (empty($data)) { return $data; } $strippedData = preg_replace(self::STRIP, '', $data); if (preg_match(sprintf('/%s/', TwitterIdentity::USERNAME), $strippedData, $matches)) { return $matches[0]; } return $data; } }
by Matthew Fitzgerald














