How to do a crawling test using wget
At work, before the deployment of any major version of our site, we do a crawling test in order to check that all links work as expected. There are some applications that can help us to do it easily, like KLinkStatus, but in this post I will explain how to check all links from a website just using wget. It is as simple as running the following command:
wget -r -S http://yoursite.com 2>&1 | tee /tmp/crawlingTest
The -r parameter turns on the recursive retrieving and the -S prints the headers sent by HTTP servers. This command also logs all in the file "/tmp/crawlingTest" so you can filter in real time to see Internal Error Pages (500), Not Founds (404), Gones (410) or whatever you want. E.g.:
tail -f /tmp/crawlingTest | grep "HTTP/1.*500" # see Internal Error Pages (500)
But, what if you need authentication? You can also use recursive wget in pages with basic authentication adding the params --user and --password. E.g.:
wget -r -S --user=johndoe --password=mypassword yoursite.com 2>&1 | tee /tmp/crawlingTest
Why use wget instead of an specific link checker software? With wget you don't need to download, install and configure any new software; it is really simple to use; you can run it just from the command line, so you don't need a window system; and, it consumes less resources than most of the other apps designed to do so.
For more information, take a look at the GNU Wget Manual and this thread in StackOverflow.
















