Matthew Henderson

college station, tx

Search the NY Times Archive

Jul 24, 2014

UPDATED: This article has been updated to include example code for searching The Guardian's archive as well using their API. That may be found at the bottom of this article.

The Background

I needed to be able to search the NY Times for articles in their archives which match a search term within a defined time period. Thankfully, they have provided an excellent (and free) resource through an API (v2) which makes that possible.

The first thing you will want to do is review their terms and apply for an API key here: http://developer.nytimes.com Then, if you are interested, I have included the code I used to retrieve my results below, which you might find helpful in performing your own searches. You will need to enter your API key value in the code before running (look for apikey near the very end of the code).

The example code requires the httparty gem, which you may find out more about here: https://github.com/jnunemaker/httparty Install with “sudo gem install httparty”.


What it Does

Running the ruby code below should prompt you to enter:

  • the term you want to search on
  • the begin and end dates in YYYY-MM-DD format

An appropriately named CSV output file for the results is created in the folder where the script is run. Some simple information relating to the search will be displayed as it is running in the terminal.

The following retrieved pieces of information from the returned search results are saved in the CSV file: pub date, source, headline, URL. I have noted some other values that are available in the results, and they should be fairly simple to add to an output source such as the results file.


IMPORTANT

1) Be sure and replace the “apikey” value in the code below with your own key. Otherwise, you will get no results. You may obtain a key and review the terms of use here: http://developer.nytimes.com

2) The example code requires the httparty gem, which you may find out more about here: https://github.com/jnunemaker/httparty Install with “sudo gem install httparty”.

3) PLEASE NOTE that the results file created will overwrite any files that may happen to be named the same thing each time it is run. While this is unlikely, you can set the default output name in the code. Currently it is set to save using a name following this format: “nytimes-SEARCHTERM-BEGINDATE-ENDDATE.csv”


The Code

## NY Times Archive Headline Search:
## saves results in CSV file
##  (date | source | headline | url)

require 'rubygems'
require 'httparty'


def getnumberhits(searchterm,searchdatestart,searchdateend)
  url = "#{@url}&q=#{searchterm}&begin_date=#{searchdatestart}&end_date=#{searchdateend}"
  begin
    data = getdata(url)
    hits = (data["meta"]["hits"]).to_i
  rescue
    hits = 0
    puts "Processing #{searchdatestart} : No Results"
  end
  
  puts "Processing #{searchdatestart}-#{searchdateend} : #{hits}"
  if hits > 1000
    puts ">>>> WARNING <<<< : Results for this period exceed 1000. Process in smaller segments."
  end
  
  resultsbytens = hits/10.to_f
  pages = resultsbytens.to_i
  if pages < resultsbytens
    pages += 1 ## account for last partial results page
  end
  
  ## page results begin with 0 since it is considered the offset of 0
  (0..pages-1).each do |page|
    puts "processing page: #{page+1} of #{pages}"
    getarticles(searchterm,searchdatestart,searchdateend,page)
    sleep(2) ## pause between page requests
  end
end


def getarticles(searchterm,searchdatestart,searchdateend,page)
  url = "#{@url}&q=#{searchterm}&begin_date=#{searchdatestart}&end_date=#{searchdateend}&page=#{page}&sort=oldest"
  data = getdata(url)

  articles = data["docs"] ## array
  articles.each do |a|
    ## AVAILABLE VALUES (some are hashes):
    ## headline, keywords, pub_date, abstract, web_url,
    ## byline, source, section_name, subsection_name
    ## print_page, snippet, lead_paragraph, blog, multimedia, 
    ## document_type, news_desk, type_of_material, _id, word_count
    headline = a["headline"]["main"]
    weburl = a["web_url"]
    source = a["source"]
    pubdate = a["pub_date"].split("T")[0]
    line = "#{pubdate},#{source}," + '"' + headline + '",' + weburl
    File.open(@resultsfile, 'a') {|f| f.write("#{line}\n") }
  end
end


def getdata(url)
  response = HTTParty.get(url)
  data = JSON.parse(response.body)["response"]
  return data
end


def processdatesforarticles(apikey)
  ## set base API URL
  @url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?api-key=#{apikey}"

  ## get input for searchterm, and begin/end dates
  print "Search term: "
  searchterm = gets.strip
  print "Begin date [yyyy-mm-dd]: "
  begindate = gets.strip
  print "End date [yyyy-mm-dd]: "
  enddate = gets.strip

  ## initialize results file
  @resultsfile = "nytimes-#{searchterm}-#{begindate}-#{enddate}.csv"
  File.open(@resultsfile, 'w') {|f| f.write("date,source,headline,url\n")}
  
  ## cycle through dates one month at a time
  btime = (Date.parse(begindate))
  etime = (Date.parse(enddate))
  
  while btime <= etime
    searchdatestart = btime.to_s.gsub("-","")
    ## if less than one month left in search period, 
    ## set end date to user selected end date
    if (btime >> 1) < etime
      ## subtract one day so the first day of next month
      ## is not requested twice (this request's end date
      ## and next requests begin date). 
      searchdateend = ((btime >> 1) - 1).to_s.gsub("-","")
    else
      searchdateend = etime.to_s.gsub("-","")
    end
    
    getnumberhits(searchterm,searchdatestart,searchdateend)
    btime = btime >> 1 ##advance by one month
    sleep(10) ## pause between day requests
  end
end


## ***********************
## EXECUTE
## ***********************

apikey = "ENTER YOUR KEY HERE"
processdatesforarticles(apikey)

The Very Basics

You probably don’t need this, but in case you do (everyone has needed some help getting started at some point), here is a simple step-by-step that may help you if you want to try to run a search using my code:

  1. Open your terminal and type “ruby -v”. As long as it returns a version of at least 1.9.3, you are ready to continue. If it returns 1.8.7, you will need to look at upgrading your Ruby version (I recommend you look into rbenv)
  2. Install the required gem by running “sudo gem install httparty” in the terminal
  3. Copy the code below and save it as a file named “search-nytimes-articles.rb”
  4. Get your own API key from the NY Times developer website: http://developer.nytimes.com
  5. Update the “apikey” value at the bottom of the code with your own API key
  6. In your terminal, change directories to the path where the “search-nytimes-articles.rb” file resides
  7. Now, run “ruby search-nytimes-articles.rb”
  8. It will prompt you for the input it needs and save the results in a new file in the same location using the naming format mentioned above in the IMPORTANT section


BONUS: Searching The Guardian

The code below might be helpful if you are wanting to search the Guardian using their API (new version is currently in beta).

First, you will need to visit their open platform website and register for an API key: http://www.theguardian.com/open-platform.

The code below is based on the NY Times code above, but has already been altered for use with the Guardian’s API.

##  Guardian Archive Headline Search:
##  saves results in CSV file
##  (date | section | headline | url)
##*************************************

require 'rubygems'
require 'httparty'


def getnumberhits(searchterm,searchdatestart,searchdateend)
  url = "#{@url}&q=#{searchterm}&from-date=#{searchdatestart}&to-date=#{searchdateend}&page-size=100"
  begin
    data = getdata(url)
    pages = (data["pages"]).to_i
    hits = (data["total"])
  rescue
    pages = 0
    puts "Processing #{searchdatestart} : No Results"
  end
  
  puts "Processing #{searchdatestart}-#{searchdateend} : #{hits} hits, and #{pages} pages."
  (1..pages).each do |page|
    puts "processing page: #{page} of #{pages}"
    getarticles(searchterm,searchdatestart,searchdateend,page)
    sleep(2) ## pause between page requests
  end
end


def getarticles(searchterm,searchdatestart,searchdateend,page)
  url = "#{@url}&q=#{searchterm}&from-date=#{searchdatestart}&to-date=#{searchdateend}&page-size=100&page=#{page}&order-by=oldest"
  data = getdata(url)

  articles = data["results"] ## array
  articles.each do |a|
    ## AVAILABLE VALUES (some are hashes):
    ## webTitle, webPublicationDate, sectionName
    ## sectionId, id, webUrl, apiUrl
    headline = a["webTitle"]
    weburl = a["webUrl"]
    source = a["sectionName"]
    pubdate = a["webPublicationDate"].split("T")[0]
    line = "#{pubdate},#{source}," + '"' + headline + '",' + weburl
    File.open(@resultsfile, 'a') {|f| f.write("#{line}\n") }
  end
end


def getdata(url)
  response = HTTParty.get(url)
  data = JSON.parse(response.body)["response"]
  return data
end


def processdatesforarticles(apikey)
  ## set base API URL
  @url = "http://beta.content.guardianapis.com/search?api-key=#{apikey}"

  ## get input for searchterm, and begin/end dates
  print "Search term: "
  searchterm = gets.strip
  print "Begin date [yyyy-mm-dd]: "
  begindate = gets.strip
  print "End date [yyyy-mm-dd]: "
  enddate = gets.strip

  ## initialize results file
  @resultsfile = "guardian-#{searchterm}-#{begindate}-#{enddate}.csv"
  File.open(@resultsfile, 'w') {|f| f.write("date,source,headline,url\n")}
  
  ## cycle through dates one month at a time
  btime = (Date.parse(begindate))
  etime = (Date.parse(enddate))
  
  while btime <= etime
    searchdatestart = btime.to_s
    ## if less than one month left in search period, 
    ## set end date to user selected end date
    if (btime >> 1) < etime
      ## subtract one day so the first day of next month
      ## is not requested twice (this request's end date
      ## and next requests begin date). 
      searchdateend = ((btime >> 1) - 1).to_s
    else
      searchdateend = etime.to_s
    end
    
    getnumberhits(searchterm,searchdatestart,searchdateend)
    btime = btime >> 1 ##advance by one month
    sleep(10) ## pause between day requests
  end
end


## ***********************
## EXECUTE
## ***********************

apikey = "YOUR API KEY"
processdatesforarticles(apikey)