Never trust any input. And by any input, I mean, ANY input!
I just spent an hour or two ripping my hear out metaphorically because I could not for the life of me figure out why some data was not correctly being saved into our database.
We have a usage database that pulls the data from an Amazon Web Services ELB traffic log, collates the data, and then saves it to a database. (Why aren't we using a log analyzer like Kibana, you ask? Because we need to link up the data to related records in our main database, and because we need to do a lot of transformations on the data that are sufficiently complicated that it made more sense just to roll our own tool).
Anyway, this is what a line form the ELB log might look like (taken from the "Access Logs for Your Classic Load Balancer" page of AWS documentation):
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET /product/1234/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I would extract the URL from each line entry, then compare it with expected regex patterns and create the correct entry in the database. Something like this:
PATH_PATTERNS = ( r'^/paper/(?P<paper_id>\d+)$': 'paper', r'^/journal/(?P<journal_code>[-\w]+)$': 'journal', )
In the example url above, it would match to the first pattern, and create an entry in the log database pointing to a paper record with the id 1234.
I was using python's [csv.DictReader](https://docs.python.org/3.5/library/csv.html) to break apart the line which then allowed me to grab the request, which looked like:
"GET /paper/1234 HTTP/1.1"
Here's the code I used to determine the requested URL:
request_method = request.split()[0] url = ' '.join(request.split()[1:]).rstrip('HTTP/1.1')
Now, looking back I can definitely see better ways to write this code anyway. Assuming the format of METHOD{space}URL{space}HTTP/1.1, this is much cleaner code:
request_method, url, _ = request.split()
But for the sake of argument, let's assume that when I wrote this, the way I parsed it was correct — that the first token should be the request method, and the remainder of the string should be the URL, and we should strip away the HTTP/1.1 at the end.
Does anyone see the problem?
No?
The cardinal rule of programming:
You can never trust customer input
In this case, although Amazon is scrupulous in how it outputs its log data, it is still just reporting the content it gets from another source — in this case, the request made by a user agent to our server that created the log entry in the first place.
Turns out even though HTTP 1.1 is 20 years old there are still some user agents out there that use HTTP 1.0!
Here is the request that was confounding my app:
"GET /paper/1234 HTTP/1.0"
After processing, the url, which should have been /paper/1234 was, instead /paper/1234 HTTP/1.0 which of course did not match any of the expected patterns. So none of the records for this user agent were being recognized or saved to the database!
A simple fix, fortunately. (Again, just in case I'm not using the simpler cleaner code.):
import re ... request_method = request.split()[0] url = ' '.join(request.split()[1:]) url = re.sub(r'HTTP/[.\d]+$', '', url)
And here, I could also write it as
request_method = request.split()[0] url = ' '.join(request.split()[1:-1])
but I wanted to explicit about exactly what I was excluding here. Writing out the re.sub makes it clearer that I'm taking out the protocol information.
Anyway, so that's it! Not only can you not trust what the data is, you can't even trust how they send it to you.

















