Migrating from Tumblr and Wordpress to Docpad - Extract and Transform
How to extract #tumblr and #wordpress posts, and add them to #docpad using #nodejs

#dc comics#batman#dc#bruce wayne#tim drake#dick grayson#batfam#dc fanart#batfamily



seen from New Zealand
seen from China
seen from China
seen from China
seen from United States

seen from United States

seen from United States

seen from New Zealand
seen from China
seen from China
seen from United States

seen from United States
seen from United States
seen from United Kingdom
seen from Brazil
seen from United States
seen from United States

seen from Netherlands
seen from Brazil
seen from United States
Migrating from Tumblr and Wordpress to Docpad - Extract and Transform
How to extract #tumblr and #wordpress posts, and add them to #docpad using #nodejs

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
Migrating from Tumblr and Wordpress to Docpad - Static Site Generation
Going from #tumblr and #wordpress to #docpad - the case for static site generation
Migrating from Tumblr and Wordpress to Docpad - Part 2
As promised in the previous post, let us take a look at how to extract data from a wordpress blog, and transform it for docpad.
Get Your Node On
mkdir blog-extract cd blog-extract npm init #accept all the defaults, it isnot very important npm install --save request mkdirp moment tumblr.js touch index.js
Edit index.js, and add the following:
var fs = require('fs'); var url = require('url'); var path = require('path'); var mkdirp = require('mkdirp'); var moment = require('moment'); var request = require('request'); var tumblr = require('tumblr.js');
Now we have a shiny new NodeJs project ready to go, with batteries (dependencies) included.
Wordpress Posts API
Wordpress exposes a JSON API that allows you to extract your posts. There is almost no set up required, as no form or authentication is required.
In order to get our posts, we can follow these instructions.
Extract and transform
With the API documentation in hand, we can now write some code to automate that - we certainly do not want to be issuing multiple wget or curl calls, and then copying the results from them into new files. I would do that for maybe a couple of posts, but I am dealing with about 80 posts here, and that is certainly going to be too time consuming of an endeavour!
var pos, step, total; var wordpressSite = 'yourblogname.wordpress.com'; // replace with your own pos = 0; step = 20; total = 0; do { /* * Here we do the queries, and be sure to set total so that it loops more than once * The looping is necessary, because you cannot download all posts at once * and we must paginate the requests */ } while (pos < total);
That is the basic run loop. Within the run loop, we perform the actual query:
var reqUrl = 'https://public-api.wordpress.com/rest/v1/sites/'+wordpressSite+'/posts/?number='+postsAtATime+'&offset='+postIdx; request(reqUrl, function(err, resp, body) { if (err || resp.statusCode !== 200) { console.log(err); return; } body = JSON.parse(body); if (body.total_posts > total) { //set total count, should only happen the first time total = body.total_posts; } //parse each of the posts in the response body.posts.forEach(function(post) { //transform the post into the format required by docpad //and write to file }); });
We can take a look at what the API response for each blog post looks like in these instructions.
The format that we need to translate to consists of two important parts:
Directory and file name
Metadata
The third part is the post's contents, but that can be copied verbatim without any transformation.
For a default docpad blog configuration, this would usually be: src/documents/posts/slug-for-this-post.html
We get check this by looking at docpad.coffee, and inspecting docpadConfig.collections.posts:
`@getCollection('documents').findAllLive({relativeDirPath: 'posts'}, [date: -1])`
We are however, not going to put our extracted files in the posts folder, and put them in a wordpressposts folder instead. Instead we will create a separate folder for all the wordpress posts, and configure docpad to look there as well. This configuration will be covered at the end, so if you want to test things out right away, skip to the bottom of the post
I am using the plugin, docpad-plugin-dateurls, so the URL paths of each of the posts is will match the default wordpress URL paths. Here, we want the directory and file name to follow this pattern: src/documents/wordpressposts/YYYY-MM-DD-slug-for-post.html
var postUrl = url.parse(post.URL); var pathname = postUrl.pathname; if (pathname.charAt(pathname.length - 1) === '/') { pathname = pathname.slice(0, -1); } pathname = pathname.slice(1).replace( /\//g , '-'); var filename = path.normalize('src/documents/wordpressposts/'+pathname+'.html');
For the metadata, we use moment to format the date and time
var title = post.title && post.title.replace(/"/g, '\\"'); var date = moment(post.date).format('YYYY-MM-DD hh:mm'); var tags = Object.keys(post.tags).join(', '); var contents = '---\n'+ 'layout: post\n'+ 'comments: true\n'+ 'title: '+title+'\n'+ 'date: '+date+'\n'+ 'original-url: '+post.URL+'\n'+ 'dateurls-override: '+postUrl.pathname+'\n'+ 'tags: '+tags+'\n'+ '---\n\n'+post.content;
Finally, write the output to file:
var dirname = path.dirname(filename); mkdirp(dirname); fs.writeFile(path.normalize(filename), contents, function(err) { if (err) { console.log('Error', filename, err); return; } console.log('Written', filename); });
Tumblr Posts API
Tumblr is a little more involved than Wordpress, as in order to query any of their API, you will need to have a tumblr account (which you probably already have since you are extracting your posts from it), and [register a tumblr app] to obtain an API keys. Copy your "OAuth Consumer Key", and you are good to go.
Once that is done, we simply need to follow this section in the documents. The upside ofthis slightly higher complexity is that tumblr provides a NodeJs client library that makes it easier to call the tumblr API, and avoid having to deal with making raw HTTP requests, like we did for the Wordpress API.
Extract and transform
var tumblrSite = 'bguiz.tumblr.com'; // replace with your own var client = tumblr.createClient({ consumer_key: 'sfsdfsdfsdfjkjksjdfhkjkjhkjshdfkjhkjhskdjfhkjhkjhd' //replace with your own }); pos = 0; step = 20; total = 0; do { /* * Perform the paginated requests */ } while (pos < total);
Performing the requests:
client.posts(tumblrSite, { offset: pos, limit: step, }, function(err, data) { if (err || ! data) { console.log(err, data); return; } if (data.total_posts > total) { //set total count, should only happen the first time total = data.total_posts; } data.posts.forEach(function(post) { //transform the post into the format required by docpad //and write to file }); });
Here, we want the directory and file name to follow this pattern: src/documents/tumblrposts/YYYY-MM-DD-slug-for-post.html
var ts = moment(post.timestamp*1000); var postUrl = url.parse(post.post_url); var dateStr = ts.format('YYYY-MM-DD hh:mm'); var filename = 'src/documents/tumblrposts/'+ts.format('YYYY-MM-DD')+ '-'+postUrl.pathname.split('/').slice(-1)[0]+'.html';
For the metadata, we want to set the dateurls-override property. Note that this feature is not yet available on in docpad-plugin-dateurls, and you will need my patch for this to work. To get this, modify package.json in your root folder, replacing the version number of the plugin with an explicit git URI, like so:
"docpad-plugin-dateurls": "git+ssh://[email protected]:bguiz/docpad-plugin-dateurls.git#exclude-option",
This tells npm to install a NodeJs package, not from the default npm repository, but instead by cloning a git repository. Unfortunately, this also means docpad will not be able to run the plugin yet, as npm installing a git url does not run prepublish. To work around this, for now, you need to do the following:
npm install docpad run # fails "Error: Cannot find module 'node_modules/docpad-plugin-dateurls/out/dateurls.plugin.js'" cd node_modules/docpad-plugin-dateurls cake compile ls out #you should see dateurls.plugin.js cd ../.. docpad run # success!
For tumblr posts, the default URL path follows the format /post/12345678/slug-for-this-post, and if we migrate posts from the old blog to the new blog, any links, especially extrenal ones, to the site will be broken. That will make for a really annoying experience for those visiting your sites, so it is best to preserve URLs where possible; hence the need to override the default URLs.
var title = post.title && post.title.replace(/"/g, '\\"'); var tags = post.tags.join(', '); var contents = '---\n'+ 'layout: post\n'+ 'comments: true\n'+ 'title: '+title+'\n'+ 'date: '+dateStr+'\n'+ 'original-url: '+post.post_url+'\n'+ 'dateurls-override: '+postUrl.pathname+'\n'+ 'tags: '+tags+'\n'+ '---\n\n'+post.body;
Finally, write the output to file:
var dirname = path.dirname(filename); mkdirp(dirname); fs.writeFile(path.normalize(filename), contents, function(err) { if (err) { console.log('Error', filename, err); return; } console.log('Written', filename); });
Docpad Configuration Changes
We edit docpad.coffee, in the root directory of the docpad project. Modify docpadConfig.collections.posts to look like this instead.
@getCollection('documents').findAllLive({relativeDirPath: {'$in' : ['docpadposts', 'tumlrposts', 'wordpressposts']}}, [date: -1])
All the wordpress posts should be in src/documents/wordpressposts, tumblr posts in src/documents/tumblrposts. When writing any new docpad posts save them in src/documents/docpadposts.
If you have any docpadConfig.environments configured, be sure to modify each of their collections.posts accordingly too.
That is all there is to do for now. Execute docpad run, and visit the newly extracted blog in a browser!
Where to from here?
One task in blog extraction, that we have not covered here, is that of any static assets, such as images, that may have been hosted on your previous blogs. Most notably, images. If you have hosted these on CDNs, they will continue to work. Otherwise, you will need to extract them too.
Another extraction task that we have not covered are links between posts. Since we have preserved the path for each post's URL here, this should not pose a problem.
The solution to both of these involves parsing the URLs in each post's content, be it href attributes in <a> tags, or src attributes in <img> tags, and download and save them too.
Migrating from Tumblr and Wordpress to Docpad - Static Site Generation
I currently write my blog using tumblr, and previously I blogged using wordpress. While both of these are great platforms, they share common pitfalls, when it comes to giving you control over your writing.
I wanted to be able to have a copy of all the assets that comprise my blog, in its entirety, on my hard disk, and be able to modify and publish them as I pleased. I also wanted to be able to include fancier things in my pages - like embed a Github gist, or create my own d3 visualisation, or, well why not take it to an extreme, create an AngularJs app running within one of my posts; and I wanted to be able to do all of these things without having to log into some website hosted in a far away country, and wait for all those bytes to fly across several oceans and back each time.
Flexibility and control - that is key.
Enter Static Site Generators
For a blog, the contents are almost static. The server only needs to send a different response for a page, when that page has been modified by the author. The exception to this are comments, but with the advent of disqus, that is no longer even a consideration.
A content management system, including both tumblr and wordpress, builds each page upon demand, which can be an expensive operation, as it involves database queries, assemlby of templates, et cetera. Quite often, when a CMS driven site receives a lot of concurrent visitors, its response times start to lag noticeably. To work around this, it has become common practice to cache the results of each dynamically generated page, using tools like memcached.
Static site generation is all about taking caching to the next level. The author of the site knows exactly when the previous cache needs to be invalidated - when they write a new post or update an existing one. Why not, at that point of time, generate the cache contents, and upload them directly to the server? Well, that is exactly what static site generators do; the static files are the cache
What about collaboration?
One of the big advantages of a CMS is that it enables collaboration. If everyone just logs into the same website, be it wordpress.org or tumblr.com, and made their edits on the site, then there is only one copy of the site, and therefore it is easy to manage collaboration on the contents of the site.
Indeed that is a very direct and simple solution that addresses collaboration. We do, however, have a more sophisticated solution, that is already readily available: distributed version control systems. Tools such as git and mercurial have solved the distributed collaboration problem in a rather elegant way. All collaborators get to keep a copy of the site that they are contributing to on their own computers, and thus get the benefits that come along with that. When they are done writing a post, they simply have to push their latest contributions to the master copy. There are built in mechanisms to resolve any conflicts, for example, if two collaborators edit the same file.
Docpad
After reviewing the top few in this humungous list, I have decided that Docpad suits my needs the best, and I should be able to hit the ground running. I will give it a go, and the best part is, if I do not like it, my data is not stuck on some server somewhere - it will all be on my computer, and easily moved to a different static site generator.
In the next post, I will be tackling that very problem: With hosted CMSs, like tumblr and wordpress, getting your data out can be a little tricky; as can be transforming it such that it can be used in a static site generator.
Docpad & Dropbox
Docpad: Artikel mit Dropbox veröffentlichen
Vorwort
Durch mein neuerliches Interesse an Node.js wollte ich auch meinen Blog bzw. meine Landing Page mittels eines Static Site Generators betreiben, der auf Node Basis läuft. Auf meiner Suche nach einem passenden System bin ich ziemlich schnell auf Docpad gestoßen. Docpad selbst beschreibt sich folgendermaßen:
DocPad is a next generation web framework; allowing for content management via the file system, rendering via plugins, and static site generation for deployment anywhere. It's built with Node and Express.js, making it naturally fast and easily extendable.
Für mich waren vor allem die Markdown Unterstützung, das integrierte Scaffolding mittels Bootstrap und Jade wichtig. Da ich es noch von meinem alten Blog auf Second Crack Basis kannte wollte ich auch die Möglichkeit haben, Blogposts mit Dropbox zu veröffentlichen. Da ich keine vernünftige Anleitung im Netz gefunden habe, wie man so etwas mit Docpad umsetzt, habe ich mich kurzer Hand selbst ans Werk gemacht.
Da ich auf meinem Server Ubuntu laufen habe, werden sich alle nachfolgenden Schritte auf diese Distribution beziehen.
1. Docpad einrichten
Als erstes installieren wir Node.js, und zwar die aktuelle Version und nicht die veraltete aus den Repositories:
wget http://nodejs.org/dist/v0.8.22/node-v0.8.22.tar.gz tar -xzf node-v0.8.22.tar.gz cd node-v0.8.22 ./configure make sudo make install
Dann überprüfen wir, ob npm auch auf dem neuesten Stand ist und installieren Docpad damit.
npm install -g npm npm install -g [email protected]
2. Dropbox installieren
Jetzt geht es an die Dropbox. Ich habe mir dafür einen extra Account angelegt, da ich keine Lust hatte meine komplette Dropbox auf den Server zu spiegeln.
Um den Linux Client zu installieren braucht man nur nach der Anleitung von Dropbox zu gehen.
CLI-Script und Autostart richtet man dann so ein:
cd ~ && wget -O - "https://www.dropbox.com/download?dl=packages/dropbox.py" chmod +x dropbox.py ./dropbox.py autostart y ./dropbox.py start
Nach der Einrichtung habe ich einen Ordner names "Blog" in der Dropbox angelegt und diesen meinem Haupt-Account freigegeben. Somit habe ich auf dem Server nur den Blog Ordner liegen und kann trotzdem von meinen Devices aus über den Haupt-Account darauf zugreifen.
3. Blog erstellen
Auf dem Server begeben wir uns in den gerade angelegten Blog Ordner, legen ein paar Unterordner an und erstellen ein neues Docpad, in etwa so:
cd ~/Dropbox/Blog/ mkdir drafts mkdir scripts mkdir site cd site docpad run
Nachdem ihr alle Fragen nach dem letzten Schritt beantwortet habt sollte sich das Grundgerüst eurer neuen Seite in ~/Dropbox/Blog/site befinden. Bevor es weiter geht überprüfen wir, ob auch alles funktioniert hat. Wenn sich nach einem
docpad generate --env static
brauchbare Dateien im Ordner ~/Dropbox/Blog/site/out befinden kann es weiter gehen.
4. Dropbox und Docpad verheiraten
Nun zum eigentlichen Ziel: eine Seite die sich aktualisiert, sobald sich etwas an den zugrundeliegenden Markdown Dateien ändert. Damit das funktionert legen wir uns ertmal ein passendes Shellscript an
cd ~/Dropbox/Blog/scripts nano publish.sh
und befüllen es mit folgendem Inhalt (muss natürlich an die genaue Konfiguration angepasst werden)
#!/bin/bash cd /home/<user>/Dropbox/Blog/site rm /home/<user>/Dropbox/Blog/site/src/documents/posts/* cp /home/<user>/Dropbox/Blog/publish/*.md /home/<user>/Dropbox/Blog/site/src/documents/posts/ chmod 0755 /home/<user>/Dropbox/Blog/site/out echo generating site docpad generate --env static echo setting rights cd /home/<user>/Dropbox/Blog/site/out chmod 0755 $(find . -type d) echo all done
Nach dem Speichern noch die passenden Rechte verpassen und ausprobieren
chmod +x publish.sh ./publish.sh
Damit werden alle Dateien unter ~/Dropbox/Blog/publish in das passende Verzeichnis geschoben und die Seite neu gebaut.
Nun benutzen wir incron, damit das alles auch schön automatisch passiert.
sudo apt-get update sudo apt-get install incron sudo service incron start incrontab -e
Was ihr nun seht ist die incrontab welche ähnlich wie die crontab zu verstehen ist, nur das die Tasks hier nicht zu einer bestimmten Zeit ausgeführt werden, sondern sobald Ereignisse im Filesystem auftreten. Die Einträge haben das Format
<path> <mask> <command>
<path> gibt dabei den zu überwachenden Pfad und <command> das Kommando an, welches ausgeführt wird sobald das bei <mask> festgelegte Ereignis eintritt. Damit auf Änderungen reagiert wird, tragen wir folgendes ein:
/home/<user>/Dropbox/Blog/publish/ IN_MODIFY,IN_DELETE,IN_CLOSE_WRITE,IN_MOVE /home/<user>/Dropbox/Blog/scripts/publish.sh
Damit sollte bei jeder Änderung im publish Verzeichnis die Seite neu generiert werden und unsere Arbeit getan sein.
Was natürlich noch fehlt, ist das Anpassen der Konfiguration des Webservers. Da ich stark davon ausgehe, dass das jeder hin bekommt (es gibt unzählige Tutorials zu dem Thema) werde ich das hier nicht im einzelnen erklären. Ihr müsst auch eigentlich nur das Verzeichnis ~/Dropbox/Blog/site/out per Webserver bereitstellen.
Ich habe jetzt noch meinen publish Ordner mit Git versioniert (zu finden bei meinen public repositories) aber das ist natürlich keine Pflicht.

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.
Free to watch • No registration required • HD streaming
The new website is here
As you can probably see, our updated website is now up & running. We are quite proud of it as it is a lighter, cleaner and overall better looking version of the previous one.
For this project, we decided to use a bunch of new technologies in order to make it easier to maintain & deploy. We have also open sourced the entire website code on Github, just in case you want to use pieces or even the entire backend logic for your own projects!