
Scraping Buzzfeed with wget
For our first assignment I scarped the website Buzzfeed with the command line utility wget. First, I tried the argument -m which does a mirror host but does not follow any external links. So the line was :
wget -m www.buzzfeed.com
This produced an index.html file which quite literally mirrored the site. You can see a side by side comparison here:
Next, I tried downloading the entire news section of Buzzfeed. To accomplish this I used the command line:
wget –recursive –no-clobber –page-requisites –html-extension –convert-links –restrict-file- names=windows –domains buzzfeed.com –no-parent buzzfeed.com/news
This produced another html text file giving me the entire Buzzfeed news section.