APImethods

How to find a list of old URLs for any domain using the Wayback CDX API

By October 27, 2017 No Comments

How many times have we had to correct a bad migration? Don’t you wish you could get a list of URLs? If the domains in question didn’t block the IA bot (User Agent: archive.org_bot), then you might be in luck.

Note: This was previously documented by Patrick Stox on Search Engine Land and recommended as supplementary reading.

The intelligent folks at Wayback have the CDX server API up and running (Beta). Let’s look back at Seomoz.org from 2007 to 2011 for a second, visit this URL in your browser: http://web.archive.org/cdx/search/cdx?url=seomoz.org&matchType=domain&fl=original&collapse=urlkey&limit=20000&output=json&from=2007&to=2011&filter=statuscode:200

wayback cdx server api

Let’s take a look at my query and take it apart

1) MatchType

http://web.archive.org/cdx/search/cdx?url=seomoz.org&matchType=domain

Here, I’m requesting all pages from the “domain” which means any page on the seomoz.org domain. For full reference go to https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#url-match-scope

2) Limit

&limit=20000

The maximum here is 150000, so you’ll have to get used to offset and pagination past this.

Reference: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#query-result-limits

3) Output

&output=json

Here I’m just making it easier to parse by spitting out JSON. Not a requirement, but JSON is my preference.

Reference: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#output-format-json

5) From (date)

&from=2007&to=2011

6) Filter

&filter=statuscode:200

Here I’m only returning results that responded with 200 response code, if I wanted to reverse this I’d add an exclamation mark &filter=statuscode:!200.

Full reference: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#filtering

7) Show only URLs 

Add in the following two parameters to suppress extra output &fl=original&collapse=urlkey. Contributed by Ryan Siddle 

Consider donating if you appreciate the Internet Archive 

David Sottimano

About David Sottimano

Founder of OpensourceSeo.org. Strategist at definemg.com

Leave a Reply