Coursecycle Blog @coursecycle - Tumblr Blog

SEO for Single Page Backbone Applications

When we were developing the front-end for Coursecycle, we decided to make it a single-page application backed by Backbone.js so that we could provide the most streamlined experience for our users. Unfortunately, one of the downsides to this approach is that our site cannot be crawled by search engine spiders, such as Googlebot, because they are unable to execute the Javascript on web pages.

In order to address this issue, we decided to set up a prerender service that would generate static pages for bots. The following is the series of events that occurs:

nginx handles all incoming HTTP requests.

If the User-Agent of the HTTP request matches a list of known bots, then it is proxied to the prerender service, which uses PhantomJS as a headless browser to execute any Javascript associated with the route. A static copy of the page, which contains all the injected content that a user might see in their web browser, is then returned to the bot.

Requests from regular users are proxied as usual to the Ruby application server. In order to deal with concurrent requests, we use puma, which shines in terms of concurrency.

In our root path server block, we pass all files through a try_files directive in order to evaluate where the request came from.

location / { root /var/www/coursecycle/public; try_files $uri @prerender; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; }

The following @prerender directive is mostly adapted from the official prerender.io nginx.conf file with a slight modification: we found that it was far easier to just flag the Googlebot through $http_user_agent as opposed to using the _escaped_fragment_ argument.

location @prerender { set $prerender 0; if ($http_user_agent ~* "baiduspider|twitterbot|facebookexternalhit|rogerbot|linkedinbot|embedly|quora link preview|showyoubot|outbrain|pinterest|slackbot|Googlebot") { set $prerender 1; } if ($args ~ "_escaped_fragment_") { set $prerender 1; } if ($http_user_agent ~ "Prerender") { set $prerender 0; } if ($uri ~ "\.(js|css|xml|less|png|jpg|jpeg|gif|pdf|doc|txt|ico|rss|zip|mp3|rar|exe|wmv|doc|avi|ppt|mpg|mpeg|tif|wav|mov|psd|ai|xls|mp4|m4a|swf|dat|dmg|iso|flv|m4v|torrent)") { set $prerender 0; } resolver 8.8.8.8; if ($prerender = 1) { set $prerender "coursecycle.com:3100"; rewrite .* /$scheme://$host$request_uri? break; proxy_pass http://$prerender; } if ($prerender = 0) { proxy_pass http://coursecycle; # rewrite .* /index.html break; } }

The last step was to generate a sitemap.xml file and submit it to Google Webmaster Tools so that they would know how to index the site. There are plenty of tutorials on how to do this online so we won't discuss it further, but we ended up writing a quick Ruby script to add all the URLs we wanted to be indexed into an XML file, then placing it at /sitemap.xml so Googlebot could find it.

We weren't sure whether it was working or not for a couple of days, as none of our sites had been indexed, but we soon detected a spike in CPU activity a couple of days later.

We initially thought it was a runaway process of some sort, but it turned out that our phantomjs server was just working hard to serve up all the pages that the Googlebot was now indexing!

As of this posting, the Googlebot is still indexing a large amount of our page, but we're pretty happy with how this is turning out!

#engineering #coursecycle

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Launch Post-Mortem: Much uniques, such excitement, hwow

We first made this public at around 6PM yesterday, and in about six hours, had logged over 1200 unique visits, thousands of page views, and tens of shares. We're pumped that you guys are just as excited about this as we are, and we're dedicated to bringing you the best experience for course rankings in both search and user interface/experience. We will be updating our blog as we release new features and patch bugs, so be sure to follow us to get the latest on how Coursecycle is doing.

Here's the first set of improvements that are now live!

You can now add and edit comments properly. Feel free to start contributing to the repository :)

Comments are now lazy-loaded with infinite scroll. The benefit to this is that page load times should be much better.

We're also announcing new features on Facebook now, so if you want to get notifications on your News Feed, like us on Facebook

Thank you for your support everyone; we could not have done this without all of you :)

#launch #coursecycle

Part 2: Saving a piece of history by scraping Courserank

Gearing up

Although the goto for scraping several years ago was the Python mechanize library, we found that it had not been updated since 2012, and that it would probably be an uphill battle in terms of resources available. As a result, we opted to switch to Ruby and its mechanize library.

MongoDB was selected as a datastore for the data because it allowed us to quickly store JSON documents without thinking much about how the database was designed; because this is simply a data collection task, the database schema was not the first priority, though here is a sample of one of the objects in the collection:

{ "title": "CHEMENG 185", "description": "Chemical Engineering Laboratory", "avg_grade": "A", "avg_rating": "4.0", "og_url": "https://www.courserank.com/stanford/course/2006/CHEMENG/185", "comments": [ ], "grades": { "official": [ { "id": "0", "count": "0", "letter": "A+", "points": "4.3" },... ], "unofficial": [ { "id": "0", "count": "0", "letter": "A+", "points": "4.3" },... ] } }

Getting into the system

The first step was to simulate a login and retrieve the cookies necessary in my headless browser. This turned out to be a pretty simple affair, as we could store the Mechanize browser instance, intact with cookies, and use that to programatically make many requests. Spoofing the user-agent header is a quick trick that makes server systems think that the requests are coming from a legitimate browser and user, as opposed to an automated system.

def getMechanizeInstance(username, password) m = Mechanize.new m.user_agent = 'Mac Safari' m.get(LOGIN_FORM) do |page| page.form_with(:id => 'loginForm') do |f| f.username = username f.password = password end.click_button end return m end

Retrieving the data

Once logged in, we had to figure out how to programmatically access each course page through some (ideally) simple heuristic. We realized that it was actually possible to access each page through the id parameter, so we just brute forced our way through this one.

https://www.courserank.com/stanford/course?id=40000 # EDUC 386

It turns out that there are three (approximately) valid ranges of ID numbers: (1) 0 to 60,000 (2) 195,000 to 200,000 (3) 205,000 to 222,500. This was hard-coded in the script so that we would not have to search through non-existent data (that rane between 60,000 and 195,000 was especially brutal, as we were not sure if the script was still functional or not).

When the page is retrieved, the first thing that one may notice is that the entire div that is supposed to hold the comments is empty. Checking out the network logs with Chrome indicated that they were actually loaded in through an AJAX request as follows:

POST https://www.courserank.com/stanford/target/comment_sort { "courseId": <course id>, "sort": <one of [dateD, dateA, ], "token": <token> }

When we saw the token, our immediate thought was "shoot, they're familiar with this game". Our second thought was "since this is an AJAX request, it must be somewhere in my web browser". We poked around in the page source a little more, and found out that the token was conveniently loaded into the global scope with the following line of code at the beginning of the page:

Wow, how nice! We immediately went ahead and wrote a regex to get that token, and then directly made a request to the endpoint. Score!

The last little bit of data that we were missing was the grade data that was used to generate the charts at the top of the page. We initially thought that this would be difficult if not impossible to accomplish, as the charts were Flash objects, but then we realized that they too were loading the data via an AJAX request. More network sniffing indicated that the endpoint for grade data was as follows:

POST https://www.courserank.com/stanford/target/course_grades { "course": <courseid> }

and a short POST request later, a nice XML response was sitting in the Terminal window.

The last challenge was that with 39,371 distinct objects in the system spanning a range from approximately 0 to 225,000, brute forcing was taking a really long time. Fortunately, making a Ruby program threaded is a fairly simple affair, and we were able to take full advantage of the time in between network requests to process the XML data that I was retrieving.

Here are the final statistics associated with the dataset that was retrieved

> db.courserank.stats() { "ns" : "courseriver.courserank", "count" : 39371, "size" : 172544848, "avgObjSize" : 4382, "storageSize" : 230629376, "numExtents" : 14, "nindexes" : 1, "lastExtentSize" : 62554112, "paddingFactor" : 1, "systemFlags" : 1, "userFlags" : 1, "totalIndexSize" : 1283632, "indexSizes" : { "_id_" : 1283632 }, "ok" : 1 }

Final Thoughts

We still have no idea how the account we used was not banned from the system; over the course of several hours, a single account and IP address was used to make several gigabytes worth of network requests in a highly programmatic manner. Any competent network sysadmin would have immediately noticed that there was something fishy going on. Our only conclusion is that the site is (sadly) no longer considered valuable by Chegg, such that there is literally just a skeleton crew that is running the whole thing.

All code associated with the scraper is open sourced under the MIT license and located at coursecycle/courserank; if you find it useful for writing a scraper of your own, let us know!

#engineering

Part 1: Seeding the database with Explorecourses

Here at Coursecycle, we're in the process of building a community that allows you to review and rate the classes that you've taken, and share this information with your fellow students to better inform course selection. An important part of building a site like this is keeping data up to date with the university registrar so we always have the newest class listings available.

The instinctive approach is to scrape the data directly from the Explorecourses site itself. Although this worked reasonably well for our needs, we found that there was data on instructors that was not being exposed through the main Explorecourses site, and that this information was essential to creating a well-populated site.

Fortunately, there is another approach that is arguably much better, as it operates on structured data. It just involves talking to the guy who administers the site, Jim Sproch. I talked to him a while ago over email, and found that Explorecourses exposes an API of sorts through the following endpoint

http://explorecourses.stanford.edu/search

It accepts GET requests with the parameters "view", which represents the catalog snapshot that you are using, and "q", which takes a query for search. A sample request is as follows:

http://explorecourses.stanford.edu/search?view=xml-20140630&q=CS161

The XML is always returned with a tag "latestVersion" that indicates the latest catalog snapshot available, so you can always periodically check that to ensure that you are receiving the newest available catalog.

In order to get the entire XML for the catalog, just feed the wildcard operator into the query. Do not do this in the browser, as it will crash your tab since the entire data set is a little under 300MB in size. Instead, do a simple wget

wget "http://explorecourses.stanford.edu/search?view=xml-20140630&q=%"

After a couple of minutes, you'll have the entire Stanford course catalog sitting on your computer. You can use a library like nokogiri to parse the XML into a format that you might be more comfortable with, like JSON :)

In the next part, we'll be discussing how we saved the contents of the entire Courserank database in light of its impending shutdown. Follow us on Tumblr to get the latest updates!

#engineering #explorecourses

Trending Blogs

Last Seen Blogs

Coursecycle Blog