Cecilia's Musings @cecilialiao - Tumblr Blog

Road Test Python 3 with Anaconda

Two lines of code to get started on the Python 3 journey. This short tutorial will get you road testing Python 3 using virtual environment provided by Anaconda on a Windows machine in less than 10 minutes. (more…)

View On WordPress

#py2to3 #python #python3

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

The business case for open data from businesses

Open data seems to be all the rage these days. National governments are some of the major advocates that data should be made freely available for use and reuse in order to realise economic value to the tune of US$3 trillion. There are also suggestions…

View Post

#data #open data

Knowledge worker and post-holiday stress disorder

I’ve just returned from a week in Sharm el Sheikh. November is a great time to escape the cold creeping into the South West England by basking under the Egyptian sun. Of course, that is, until the weeklong bliss in paradise comes crashing down as one…

View Post

#knowledge worker #post-holiday stress disorder

Analysing (big) data on a (small) laptop

(Image credit: John Flinchbaugh)

So, I've got my hands on some big, open data by way of NHS prescription data for August 2012 and I've been trying to answer the question what, if any, price variations are there for drugs and appliances prescribed by GP practices across the UK.

Well, as it turns out, analysing (big) data on a (small) laptop is harder than it looks. I've been doing some analyses on a 32-bit Windows 7 laptop from work (I'm sure proper data wranglers are either cringing or think I am mad) with 4GB of RAM, but about 1GB is taken up by security software and essential background services. So by the time I have loaded that CSV file into a data.frame in R, there isn't much wiggle room to do the analysis in the traditional way, which was why in my last post I hit the message:

Error: cannot allocate vector of size 76.7 Mb

There are a few ways of resolving this issue. First, work smarter. In my few weeks of experimentation with R, I have found that there are usually more than one way of achieving the same result, and some ways are less memory intensive than others.

Second, you could use libraries such as ff or bigmemory that utilises chunking to overcome the memory issue, but this isn't necessarily compatible with the codes I have already written and so I would have to learn a whole new set of languages and rewrite some of the algorithms. Frankly, as a beginner, I can only take in so much learning in one go!

Another option is to go big - get more RAM. Although Windows 8 64-bit supports memory up to 512GB, the Intel chips for off-the-shelf computers for consumers (with either Core i's or Xeon chips) only support up to 32GB memory. And as you go up the RAM size, the price becomes prohibitively expensive (as far as experiment is concerned).

Finally, I could turn to the cloud. Amazon offers a freemium service so I could get myself set up with a Linux instance for no fee, set up R there and run a few queries. But again, there is yet another learning hurdle - how to set up the instance, how to use Linux, etc.

From all this, I learned that there is no getting away from technology when it comes to analysing big data set, as machines and software built for home (or small/medium enterprise) use are generally underpowered for everyday people who might want to have a crack at big data sets. While citizen and government interest in open data is growing, there are many hurdles when it comes to converting open data into useful information and insight that "little guys" can't afford to do.

As for my own little experiment, I have opted for option #3 - I now have access to a virtual server through work, essentially a sandbox environment with 12GB of RAM. Yippee! Now my next challenge is actually setting up a working database so that I can more effectively analyse data through time.

As it's now turning into a work project, I don't think I can blog much more about the fun and crazy adventure. But stay tuned, I am sure I will think of something equally amazing to fill up this space!

#data analysis #data analytics #NHS prescription data #[R]#open data

Cleaning my open data

(Image credit: clogwog)

Following my previous post, here's what I've done to try and tidy up the data and amend rows of data where "quantity" was recorded as 0.

To recap the issue, "Quantity" recorded as "0" is most likely an entry error, because another column, titled "Items", also gives reference to the number of items prescribed. There should be a direct relationship between items and quantity, such that quantity cannot be zero.

Therefore, for rows where "Quantity" has been marked as "0", I am looking to replace this with the corresponding number from "Items" as a best guess estimate.

I tried this out on my "test" variable, which is a subset of the raw data where quantity has been marked as "0", with the command:

> test$QUANTITY[test$QUANTITY==0] <- test$ITEMS[test$QUANTITY==0]

Worked like a treat. Time to do this on the large dataset...

Ever the cautious one, I don't want to actually replace anything in the "Quantity" column, in case my formula doesn't work and I ruin the raw data. So I thought I will replicate a column, called QUANTITY_ADJ, which I will manipulate instead:

> august_raw$QUANTITY_ADJ <- august_raw$QUANTITY

But....

Error: cannot allocate vector of size 76.7 Mb

Argh!

#data analysis #data analytics #NHS prescription data #[R]#open data

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Messy data - time for a scrub and a clean

(Image credit: EU Social)

To answer the question whether or not there is price variability for the same prescription across the NHS, I thought that a visual graph would be the best way to quickly assess the situation.

Histogram is usually the graph of choice for plotting variation, but with almost 20,000 different drug codes to analyse, seeing so many individual histograms will be eye-wateringly painful and difficult to digest.

So I am thinking of making some box plots. It still won't be neat and tidy, but I think it's a good starting point to judge just how variable the prices are, especially if I plot them against the same Y-axis. This will have the added advantage of seeing which boxes are "taller" (eg. more variability) or "higher" up the Y-axis (eg. more costly items) to help me decide which prescriptions I might want to delve into at a later time.

As my dataset contains information on the cost and associated quantity, I can work out the "per item" cost simply by dividing the former by the latter. This way, when I plot the cost of one particular drug, the variability isn't due to the number of items prescribed that month, but rather gives a measure that can be compared across different primary care trusts.

https://gist.github.com/4530302

In the first block of code, I created a new column "costperitem" in my raw data by doing the aforementioned division. Then I used the summary function to quickly test the min and mix boundary that I could use for my boxplot.

But then I spotted a problem: the mean and the max of costperunit is registering as "Inf," which means that in the raw data, there are instances where at least one trust has denoted zero as the quantity for a particular prescription.

That, ladies and gents, has got to be an input error. If something wasn't prescribed, it shouldn't t be on the list at all. To do a quick test on how many lines of data have a zero-quantity, I used the subset function:

https://gist.github.com/4530675

I created a variable "test" that replicates the structure of august_raw data.frame, but to include data where QUANTITY column is equals (==) to 0. Then I used the nrow() command to work out there are 19 such cases. Line 5 also does exactly the same thing without creating a new data.frame, but I find that it's best to create a new table so you can look more into the subset of data.

https://gist.github.com/4530747

Perhaps the easiest thing to do is substitute all 0's with 1's, on the assumption that quantity of at least one must have been prescribed. But then I noticed that the "items" column show different numbers other than 1's. Now, the "quantity" column should reflect the changes in "items", such that as the number of items goes up, quantity should go up proportionally. For example, think of a standard packet of ibuprofen, which might come as a box of 10 pills. "Items" refer to the box, so if 2 boxes were prescribed, then "quantity" should say 20.

After spotting these 19 rows of data showing nil quantity, it makes me question how many of the other 10,056,613 rows of data have inconsistencies... And how many future assumptions I will have to make when further analysing the data.

For the sake of this soldiering on, I will replace quantity by the number of items for these 19 instances, which I will cover in my next post.

#data analytics #NHS prescription #data #data analysis #[R]#open data

Beyond number crunching - big data analysis is about getting intimate with your data

(Image credit: zhouxuan12345678)

As my last post alluded, the data is the answer, much like 42 is the answer in the Hitchhiker's Guide to the Galaxy. The first step to analysing big data is asking the right questions, and then breaking the task at hand with smaller questions to work towards the ultimate answer using the tools available.

One of the first questions I want to answer, with the NHS prescription data, is how much price variations there are for each drug or appliance prescribed by the different GP practices across England.

The data set contains the following columns:

Strategic health authority (SHA) code (3 characters)

Primary care trust (PCT) code (3 characters)

Practice code (6 characters)

British National Formulary (BNF) code (15 characters)

BNF name (truncated to 40 characters)

Total Items

Total Net Ingredient

Breaking my question into smaller chunk, I need to work out how many drugs and appliances I am working with. This can be done by looking at the unique values in either column for BNF code (4) or BNF name (5).

In R, the unique function makes this a simple task:

https://gist.github.com/4452391

"august_raw" is the original prescription data for the month of august, containing 10m+ rows of data and the 7 columns detailed above. In the first line of codes, I created a new variable (factor, technically) called "unique_bnfname" to store the list of unique values found in column "BNF.Name" in data.frame "august_raw".

I carried out a similar procedure to find unique BNF codes, then used the length function to find out how many items are in the two factors. To my surprise, the lengths are different! There are 19,741 unique BNF codes in the dataset, but only 17,626 unique items when looking at BNF names - why?

Time to turn to the internet and learn about BNF codes and BNF names. Thankfully, the NHS has created a glossary of terms that answers this question:

The BSA prescribing and dispensing information systems use the therapeutic classifications defined in the British National Formulary (BNF) as extended by the BSA, The same drug may appear in different areas of the BNF as it can be prescribed to treat more than one condition. For example: aspirin is licensed for use as an antiplatelet and as an analgesic.

Ah ha, so some items have multiple codes because they are prescribed for different reasons. This newfound knowledge poses more questions - do identical items prescribed for different ailments have significantly different costs?

Big data analysis is more than just crunching numbers. It's about getting intimate with your data - learning about its nature and quirks - and asking more questions as you get to know it better.

In my next post, I'll update you on my progress on answering this new question.

#data analytics #NHS prescription data #data analysis #r #[r]#open data

The answer is in the data, but what questions do we ask of the data?

(Image credit: Oberazzi)

Fellow tumblr user modestgesture posed some great questions on my last post:

Are you looking for trends or differences? do you plan on investigating 2, 3, or more groups? multiple subsets within a bigger group, or just 2 columns against eachother? without looking at the data these are my initial questions…

I'll start by tackling the easiest question - what is in the dataset? The NHS prescription data includes, on a monthly basis, prescriptions by medical practitioners in England and dispensed in the UK. Each row of data details a specific item that was prescribed by an NHS practice (such as paracetamol or wound dressings), the number of items prescribed, the net ingredient cost, and the actual cost of the prescription, calculated by:

Actual Cost = (Net Ingredient Cost less discount) + Payment for Consumables + Payment for Containers + Out of Pocket Expenses

Given the nature of the data, one of the most obvious way of dissecting the data is to spot the differences. For a specific item prescribed, is there variability in the actual costs paid by different NHS practices? If so, are there practices or regions that are consistently overpaying? In addition to answering whatare the cost differences, can the data shed any insight as to why there are cost differences and how the differences could be eliminated?

Sifting the data through time would also be an interesting exercise as well as spotting long-term trends. As this would require multiple CSV files, however I will have to earmark this particular exercise for when I become more proficient with R.

In my next post I will let you know I get on with answering some of the above questions. In the meantime, keep firing away any questions you may have for me and/or the data!

#R #[R]#data analysis #data analytics #open data #NHS prescription data

Diving into big data - how hard can it be?

While I am great at regurgitating textbook material in exams, information doesn't stick with me unless I am using the knowledge, particularly if it's knowledge from seeking for a specific answer to fit the puzzle.

A bit cryptic, perhaps, but I am so bored of reading Chapters 1-3 of introductory textbooks on R (or Python, HTML5, or other books on programming), printing "Hello World" commands, doing mathematics while minding the orders of operations, and trying to remember the differences between lists, vectors, matrices, and data.frames. And of course, everything appears so obvious until I try to recall which commands I've read in the last three days would actually do the one simple task at hand, only to conclude I have wasted the last three days gaining nothing.

Sometimes, going straight into the deep end and learning from experience is the best way to go. That's why for centuries apprenticeships work.

Without getting sidetracked into a rant about the issues of first world education (!), I decided it's time to just grab a big dataset and play with it. As Jeremy Clarkson puts it, how hard can it be?

So here goes.

UK's National Health Service has been releasing prescription data by GP practices as open data, and I have downloaded the August 2012 data at the presentation level, the most recent data available at the time of writing. The CSV file is 1.3GB and I will be using R Studio (which runs on R) to analyse the data.

This is probably "small data" compared to what other people or companies may be analysing, but it's pretty big to me - the file does not open in Microsoft Excel as it has too many rows of data!

Just how many? Well, first thing to do is to load the file in R:

https://gist.github.com/4397916

This command reads the CSV into a data.frame which I have labelled "august_raw". And then all I have to do is ask R:

https://gist.github.com/4397951

My CSV file contains over 10 million rows. No wonder Excel struggles! Other interesting information about my "table" is that it's a data.frame, took over 230 seconds to load into the system, and has 11 columns of data:

https://gist.github.com/4398078

There is also a really handy command on R that gives a preview of what the data looks like:

https://gist.github.com/4398096

Okay, so not the most sophisticated analysis on big data. So here's where I need your help - are you experienced in R (and/or stats) and willing to be my teacher? After all, every apprentice needs a master craftsman. Even if you're not well versed in R, can you think of questions that this dataset (or other big and open data) could answer? Join me in my learning experiment!

#R #[R]#data analysis #data analytics #open data #NHS prescription data

I'm back!

(Photo credit: HikingArtist.com)

Hello Tumblr and Tumblr-ians!

After a longer than expected sabbatical from Tumblr, I am making a comeback. Not one weekend goes by in the past nine months when I didn't think I need to come back to my blog. But with a new job and several personal duties occupying my attention throughout the summer months - which quickly encroached into the winter months - I didn’t have time to even prioritise what it is I want to focus on, write about, and work on in either personal and professional capacities.

As 2012 comes to a close, projects are winding down and I’ve had a chance to review and cross out goals and milestones I have achieved this year. With a bit more space on the "to-do" list, and more importantly, another year of life experience, I am ready to flesh out personal ambitions against professional demands, leisure activities, and other personal interests to come up with my 2013 priorities.

And so here I am, back on Tumblr with renewed energy and new ideas. Previously I’ve been blogging about how technology is changing people’ lives for the better, particularly when it comes to data and information management. I am absolutely fascinated by data so I’m taking one step further in this direction , I am getting my hands dirty, so to speak, with data.%0A

Perhaps I am lured by the sexiness of data scientists or the opulent lifestyle achievable by big data analysts, but I like to think that I'm just a bit of a late bloomer and all those geeky things I'm interested in are finally coming together into one, happy being. I've always loved logic grid puzzles, planned my holidays on Excel spreadsheets, and found satisfaction from writing HTML codes on a note pad. It's a shame that I didn't like my C++ teacher in high school nor gave statistics a proper chance in school, but I'm hoping it's not too late to start learning. Especially this time around I'm learning something for fun, rather than aiming for an A just to get as close to a 4.0 GPA as possible.

And so, this will be the journal of my adventure into data (big and small, open and private) using tools like statistics, R, Python, scripts, and more, sprinkled with occasional musings.

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Thank you for an amazing journey, so far!

(Image credit: GTez on Flickr)

It was just over a year ago that I started my Tumblelog after starting a new job and embarking on a new career path. For me, it was the perfect time to pick up blogging again, this time looking at technology that makes a difference in our everyday lives, from the gadgets that help us see, eat, exercise, and behave better, to the nuances of digital media that affect most of us without realising it.

One year on, I have updated my Tumblelog with 320 posts (including this one) and amassed 5,165 followers. I am taking a hiatus during the month of March while I recharge my batteries and narrow down the focus of my blogging (as well as professional and personal) life. I don’t know how many of my followers are ‘real’, how many of you will see this post, and how many of you (probably not too many!) will miss me for the next few days, but feel free to reply to this post and say hello to me on Twitter @cecilialiao.

See you in April!

Using your tongue to drive

Tongue Drive System, a dental retainer with sensors, have been developed by a team of researchers at Georgia Institute of Technology to help people with high-level spinal cord injuries to navigate their electric wheelchairs.

According to Wired UK:

The mouth device contains magnetic field sensors mounted on its four corners to detect the moment of the tiny magnet attached to the tongue. It is powered by a lithium-ion battery and has an induction coil to recharge it. The retainer sits against the roof of the mouth and is protected with a water-resistant material.

The output signals from the sensors are wirelessly transmitted to an iPod Touch or iPhone. Software installed on the Apple device works out the relative position of the magnet with respect to the array of sensors in real time and interprets the user's commands. This information is then used to control the movements of a cursor on the computer screen or to substitute for the joystick function in a powered wheelchair.

Who knows, maybe these retainers will become the next geeky accessory with a mainstream appeal. Imagine navigating a computer game or application with your tongue. Anthropologists could look back on the human race in 100 years and wonder why we all have such strong, large tongues...

#tech #technology #good idea #research

Microsoft's Illumishare 'allows remote people to share arbitrary, physical, or digital objects on any surface.' The video highlights how the technology could be used for remote tutoring as an example.

This technology enables teachers sat in a different location-be they in a different part of the same classroom or a different country altogether-to observe and instantly provide feedback to their students. Can you imagine how this technology could also enable the outsourcing the education of primary school children to tutors and teachers where cost of labour is lower...? Now that's a disruptive technology.

#tech #technology #education #good cause #good idea #disrption

DNA sequencing will personalise drugs to fight cancer

As reported by MIT's Technology Review:

Starting this spring, for about $5,000, any oncologist will be able to ship a sliver of tumor in a bar-coded package to Foundation's lab. Foundation will extract the DNA, sequence scores of cancer genes, and prepare a report to steer doctors and patients toward drugs, most still in early testing, that are known to target the cellular defects caused by the DNA errors the analysis turns up. Pellini says that about 70 percent of cases studied to date have yielded information that a doctor could act on—whether by prescribing a particular drug, stopping treatment with another, or enrolling the patient in a clinical trial.

#good cause #technology #tech

Wireless chips swim through your veins will revolutionalise medical technology.

Stanford electrical engineers have created a tiny wireless chip, driven by magnetic currents, that's small enough to travel inside the human body. They hope it will someday be used for a wide range of biomedical applications, from delivering drugs to cleaning arteries.

In a few years time, these little devices could be analysing your blood, delivering regular doses of drugs, and even zapping blood clots from your arteries.

#good cause #good idea #technology #tech #research

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

BrailleTouch helps visually impaired users to type on an iPhone.

Researchers at Georgia Tech have developed a mobile app that lets users type without looking at their phone.

According to Futurity:

Early studies with visually impaired participants proficient in Braille typing have demonstrated that users can input at least six times the number of words per minute when compared to other research prototypes for eyes-free texting on a touch screen. Users reach up to 32 words per minute with 92 percent accuracy with the prototype app [of BrailleType] for the iPhone.

The developers are also looking at ways of converting BrailleTouch into a universal eyes-free texting app that replaces touch-only keyboards and other texting technologies.

One day, we might all find ourselves mindlessly drumming on our mobile phones we spell out our thoughts digitally.

#tech #technology #research #good idea #good cause #mobile app

Telemedicine - remotely medicating you in a timely fashion

What if you have a chronic condition that requires regular medication at regular intervals - would it be a big enough inconvenience for you from achieving a good quality of life? Would you fear going on holidays abroad in case you lose your medication, or would you fail to remember to take the prescriptions altogether?

Well, all those worries may soon become a thing of the past as MIT researchers have, after years of research, developed and successfully tested the first wirelessly controlled drug-delivery chip.

According to The Next Web

The programmable microchip contains capsules of medicine, and is implanted into a patient. Medicine is then released according to whatever regimen was prescribed. Professors Robert Langer and Michael Cima hope that the device will someday allow patients to receive consistent and reliable treatment without the painful task of daily injections.

What a great idea... now just to figure out the security mechanism so hackers can't maliciously take over the wireless signals and administer fatal doses from far, far away. (Sounds like a new genre of murder sci-fi thriller...)

#tech #technology #research #good cause #good idea #security

Trending Blogs

Last Seen Blogs

Cecilia's Musings