How to find a list of old URLs for any domain using the Wayback CDX API

By October 27, 2017 No Comments

How many times have we had to correct a bad migration? Don’t you wish you could get a list of URLs? If the domains in question didn’t block the IA bot (User Agent: archive.org_bot), then you might be in luck.

Note: This was previously documented by Patrick Stox on Search Engine Land and recommended as supplementary reading.

The intelligent folks at Wayback have the CDX server API up and running (Beta). Let’s look back at Seomoz.org from 2007 to 2011 for a second, visit this URL in your browser: http://web.archive.org/cdx/search/cdx?url=seomoz.org&matchType=domain&fl=original&collapse=urlkey&limit=20000&output=json&from=2007&to=2011&filter=statuscode:200

wayback cdx server api

Let’s take a look at my query and take it apart

1) MatchType


Here, I’m requesting all pages from the “domain” which means any page on the seomoz.org domain. For full reference go to https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#url-match-scope

2) Limit


The maximum here is 150000, so you’ll have to get used to offset and pagination past this.

Reference: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#query-result-limits

3) Output


Here I’m just making it easier to parse by spitting out JSON. Not a requirement, but JSON is my preference.

Reference: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#output-format-json

5) From (date)


6) Filter


Here I’m only returning results that responded with 200 response code, if I wanted to reverse this I’d add an exclamation mark &filter=statuscode:!200.

Full reference: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#filtering

7) Show only URLs 

Add in the following two parameters to suppress extra output &fl=original&collapse=urlkey. Contributed by Ryan Siddle 

Consider donating if you appreciate the Internet Archive 

David Sottimano

About David Sottimano

Trying to make OpensourceSeo.org the best free information hub for the SEO industry. Personal Website here