Forays in Web Crawling

April 16th, 2013

I've always been really interested in the idea of gathering data from the internet. It probably stems from my love of organizing, classifying, and recording stuff. I finally took an opportunity to exercise this interest this week.

On the forums for the (amazing) game Dwarf Fortress, there is a thread dedicated to illustrating "forgotten beasts" from the game. Forgotten beasts are terrifying monsters that are randomly generated by the game and lurk in the depths of the earth. Because forgotten beasts are rarely alike, the thread has seen a great variety of works from some outstanding artists. I wanted to collect these illustrations and organize them into an archive where people could browse them by artist or by the beast.

To collect what ended up being over 650 images from a 150-page forum thread, I created a web robot in C#. This part of the task provided some interesting challenges. The robot had to collect the URLs of any images that were posted in the thread and their authors, as well as the actual text descriptions of the beasts being illustrated, which were sometime in text form or in the form of screenshots from the game. Fortunately, the descriptions follow a certain identifiable formula that made it possible to pick them out with good success from all the other text in the thread. To actually associated the beast pictures with the proper descriptions and to find and transcribe the descriptions that were embedded in screenshots, I turned to a sort of "crowdsourcing" model - I set up a simple PHP page to show images to viewers and ask them to classify them. This was an easy way to finish these tasks that would have been impossible for an automated program to do (though I contemplated incorporating some kind of OCR to look for text descriptions in the images and transcribe them).

To check out some of the great art this thread has produced, visit the archive page: https://www.brianmacintosh.com/beasts/

Blog Post

Forays in Web Crawling