Messy data - time for a scrub and a clean
(Image credit: EU Social)
To answer the question whether or not there is price variability for the same prescription across the NHS, I thought that a visual graph would be the best way to quickly assess the situation.
Histogram is usually the graph of choice for plotting variation, but with almost 20,000 different drug codes to analyse, seeing so many individual histograms will be eye-wateringly painful and difficult to digest.
So I am thinking of making some box plots. It still won't be neat and tidy, but I think it's a good starting point to judge just how variable the prices are, especially if I plot them against the same Y-axis. This will have the added advantage of seeing which boxes are "taller" (eg. more variability) or "higher" up the Y-axis (eg. more costly items) to help me decide which prescriptions I might want to delve into at a later time.
As my dataset contains information on the cost and associated quantity, I can work out the "per item" cost simply by dividing the former by the latter. This way, when I plot the cost of one particular drug, the variability isn't due to the number of items prescribed that month, but rather gives a measure that can be compared across different primary care trusts.
https://gist.github.com/4530302
In the first block of code, I created a new column "costperitem" in my raw data by doing the aforementioned division. Then I used the summary function to quickly test the min and mix boundary that I could use for my boxplot.
But then I spotted a problem: the mean and the max of costperunit is registering as "Inf," which means that in the raw data, there are instances where at least one trust has denoted zero as the quantity for a particular prescription.
That, ladies and gents, has got to be an input error. If something wasn't prescribed, it shouldn't t be on the list at all. To do a quick test on how many lines of data have a zero-quantity, I used the subset function:
https://gist.github.com/4530675
I created a variable "test" that replicates the structure of august_raw data.frame, but to include data where QUANTITY column is equals (==) to 0. Then I used the nrow() command to work out there are 19 such cases. Line 5 also does exactly the same thing without creating a new data.frame, but I find that it's best to create a new table so you can look more into the subset of data.
https://gist.github.com/4530747
Perhaps the easiest thing to do is substitute all 0's with 1's, on the assumption that quantity of at least one must have been prescribed. But then I noticed that the "items" column show different numbers other than 1's. Now, the "quantity" column should reflect the changes in "items", such that as the number of items goes up, quantity should go up proportionally. For example, think of a standard packet of ibuprofen, which might come as a box of 10 pills. "Items" refer to the box, so if 2 boxes were prescribed, then "quantity" should say 20.
After spotting these 19 rows of data showing nil quantity, it makes me question how many of the other 10,056,613 rows of data have inconsistencies... And how many future assumptions I will have to make when further analysing the data.
For the sake of this soldiering on, I will replace quantity by the number of items for these 19 instances, which I will cover in my next post.