Data Scraping (progress..)
Finding the minimum ID (the lazy way)
To find the minimum ID for a recipe on allrecipes.com to ensure a quick download of data once created I ran the code from ID 1 writing to file once it found a valid ID. Ideally I would like to grab the maximum ID as well but with some gaps between IDâs this isnât really possible so I may just encorporate a limit on the number of recipes downloaded instead. This prolonged use of the code flagged a silly mistake I had made - once running too long the site occasionally times out with the connection so I added a failsafe to try for 10 seconds and time itself out and try another ID rather than crashing.Â
Here is the moment it found a valid ID:
My struggles with the csv writer
I looked for a few copy and paste tutorials for some csv code writing that I could have a look at and ensure worked. I usually do this to grasp how they work and then completely change them and continue testing it as I go. This way I can figure out where Iâm going wrong and refer to a working copy when it all goes horribly horribly wrong.
Unfortunately when I went to copy and paste code that other people have used - using the âcsvâ library and âcsv.writerâ function it just did not work.
I then figured that maybe the csv import wasnât working and went onto download anaconda - this package has SciPy, NumPy and Pandas (libraries I was eventually going to download anyway for the remainder of my project). It also handily has csv as one of its libraries and a nifty new command prompt. But alas I was getting the same errors:
AttributeError: âmoduleâ object has no attribute âwriterâ...
I would understand this if I wasnât following tutorials, stack overflow suggestions and the python docs themselves but I was sure that this was how you wrote this function.
csvout = csv.writer(csvfile, delimiter=â â...)
csv.writer doesnât exist according to the Command Prompt and also the Anaconda command prompt and any other methods I attempted. I then looked as to whether it was my version of Python - Iâve already been caught out by that before. So I decided to import the library and have a look at what built-in functions there are:
Writer is there... So whatâs the error!?Â
Turns out a terribly named file really can cause all sorts of issues and I shall never use csv.py again as long as I shall live. Two hours I wonât get back but I will always use a proper naming convention from now on and its sorted and now I can progress :)
Changing the way it saves to a csv file
Having it save to a text file is great - you can view it exactly how you want to but this really isnât a good way in which to use it to actually analyse the data. Currently Iâm torn between turning it into a csv file or actually putting it into a database. A csv file maybe easier right now but in the long run a database is something I have more experience with and is generally a good way to analyse data.Â
Creating a csv file now and attempting it this way will be the route I go down however as turning a csv file into a structured dataset in a database isnât difficult so Iâm really killing two birds with one stone.
How to store the data within the csv file
This really isnât a simple thing, with there being a list of ingredients per recipe what is best really? Iâm going to take a sample of 100 recipes to test out a few different methods.Â
<ID>, <Recipe Name>, <Number of Ingredients>, <List of Ingredients Separated>, <Stars>, <Number of ratings>
<ID>, <Recipe Name>, <Number of Ingredients>, <Stars>, <Number of ratings>