Scraping Buzzfeed with wget

Scraping Buzzfeed with wget

For our first assignment I scarped the website Buzzfeed with the command line utility wget. First, I tried the argument -m which does a mirror host but does not follow any external links. So the line was :

wget -m www.buzzfeed.com

This produced an index.html file which quite literally mirrored the site. You can see a side by side comparison here:

index.html file from wget

index.html file from wget

Live site

Live site

 

 

 

 

 

 

 

Next, I tried downloading the entire news section of Buzzfeed. To accomplish this I used the command line:

    wget –recursive –no-clobber –page-requisites –html-extension –convert-links –restrict-file-   names=windows –domains buzzfeed.com –no-parent buzzfeed.com/news

This produced another html text file giving me the entire Buzzfeed news section.