How many times have we had to correct a bad migration? Don’t you wish you could get a list of URLs? If the domains in question didn’t block the IA bot (User Agent: archive.org_bot), then you might be in luck.
The intelligent folks at Wayback have the CDX server API up and running (Beta). Let’s look back at Seomoz.org from 2007 to 2011 for a second, visit this URL in your browser: http://web.archive.org/cdx/search/cdx?url=seomoz.org&matchType=domain&fl=original&collapse=urlkey&limit=20000&output=json&from=2007&to=2011&filter=statuscode:200
Let’s take a look at my query and take it apart
Here, I’m requesting all pages from the “domain” which means any page on the seomoz.org domain. For full reference go to https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#url-match-scope
The maximum here is 150000, so you’ll have to get used to offset and pagination past this.
Here I’m just making it easier to parse by spitting out JSON. Not a requirement, but JSON is my preference.
5) From (date)
Here I’m only returning results that responded with 200 response code, if I wanted to reverse this I’d add an exclamation mark &filter=statuscode:!200.
7) Show only URLs
Add in the following two parameters to suppress extra output &fl=original&collapse=urlkey. Contributed by Ryan Siddle
Consider donating if you appreciate the Internet Archive