Drawing

Data Collection: APIs and Web Scraping in Python


Nicole Donnelly

TechLady Hackathon 4 logo created by @MarieCWhittaker

Hi.

Everything for this workshop is available online here https://github.com/nd1/tlh4_workshop

Who Are You?

How many people know python?
How many people have used APIs?
Anybody ever try web scraping?

Who Am I?

I have been using Python for a little over a year.
I am a recovering consultant.
I am learning to be a data scientist.
I really love data.

Agenda

An Overview

  • What are APIs?
  • Where are APIs?
  • How do I access APIs?
  • What is the API giving me?
  • What is web scraping?
  • When should I web scrape?
  • How do I web scrape?

Some Hands-on

  • Workshop Notebook
  • Python Scripts

What are APIs?

Application Programming Interface

“In the simplest terms, APIs are sets of requirements that govern how one application can talk to another.”

“APIs are what make it possible to move information between programs....”

Source: http://readwrite.com/2013/09/19/api-defined

Where are APIs?

apis

How do I access APIs?

Context: I will focus on REST APIs with JSON.

Read more about API types here

http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL

{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.12,"Change":-0.349999999999994,"ChangePercent":-0.297948412360598,"Timestamp":"Wed Oct 19 00:00:00 UTC-04:00 2016","MarketCap":631094444160,"Volume":20034594,"ChangeYTD":105.26,"ChangePercentYTD":11.2673380201406,"High":117.76,"Low":113.8,"Open":117.25}}

http://api.dp.la/v2/items?api_key=0123456789&q=goats+AND+cats

{"count":29,
"start":0,
"limit":10,
"docs":[{"@context":"http://dp.la/api/items/context","isShownAt":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","dataProvider":"Missouri State Archives through Missouri Digital Heritage","@type":"ore:Aggregation","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"hasView":{"@id":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"},"object":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","ingestionSequence":12,"id":"9e05f398ca95f9bbfd733e6d3493fd74","ingestDate":"2016-10-11T13:21:48.399681Z","_rev":"7-6bee4d18708d1d16efceeea1e061b316","aggregatedCHO":"#sourceResource","_id":"missouri--urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","sourceResource":{"title":["Alabama Big Cats Safari Adventure"],"description":["Children bottle feeding goats"],"subject":[{"name":"Transparencies, Slides"},{"name":"Tourist Destination"}],"rights":["Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives."],"relation":["Division of Tourism Photograph Collection"],"language":[{"iso639_3":"eng","name":"English"}],"format":"Image","collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"Mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"stateLocatedIn":[{"name":"Missouri"}],"@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74#sourceResource","identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"creator":["GD"]},"admin":{"validation_message":null,"sourceResource":{"title":"Alabama Big Cats Safari Adventure"},"valid_after_enrich":true},"ingestType":"item","@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74","originalRecord":{"id":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"header":{"expirationdatetime":"2016-10-08T17:04:17Z","datestamp":"2016-10-04T13:19:05Z","identifier":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","setSpec":"mdh_divtour"},"metadata":{"mods":{"accessCondition":"Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives.","location":{"url":[{"#text":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","access":"object in context"},{"#text":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","access":"preview"}]},"subject":[{"topic":"Transparencies, Slides"},{"topic":"Tourist Destination"}],"name":{"namePart":"GD","role":{"roleTerm":"creator"}},"relatedItem":{"titleInfo":{"title":"Division of Tourism Photograph Collection"}},"physicalDescription":{"note":"Image"},"xmlns":"http://www.loc.gov/mods/v3","language":{"languageTerm":"eng"},"titleInfo":{"title":"Alabama Big Cats Safari Adventure"},"identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"note":["Children bottle feeding goats",{"#text":"Missouri State Archives through Missouri Digital Heritage","type":"ownership"}]}}},"score":4.534843}, ... 
"facets":[]}

What is the API giving me?

http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL

{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.06,"Change":-0.0600000000000023,"ChangePercent":-0.0512295081967233,"Timestamp":"Thu Oct 20 00:00:00 UTC-04:00 2016","MarketCap":630771137580,"Volume":24059570,"ChangeYTD":105.26,"ChangePercentYTD":11.2103363100893,"High":117.38,"Low":116.33,"Open":116.86}}


http://dev.markitondemand.com/Api/Quote/xml?symbol=AAPL

<QuoteApiModel>
<Data>
<Status>SUCCESS</Status>
<Name>Apple Inc</Name>
<Symbol>AAPL</Symbol>
<LastPrice>117.06</LastPrice>
<Change>-0.06</Change>
<ChangePercent>-0.0512295082</ChangePercent>
<Timestamp>Thu Oct 20 00:00:00 UTC-04:00 2016</Timestamp>
<MarketCap>630771137580</MarketCap>
<Volume>24059570</Volume>
<ChangeYTD>105.26</ChangeYTD>
<ChangePercentYTD>11.2103363101</ChangePercentYTD>
<High>117.38</High>
<Low>116.33</Low>
<Open>116.86</Open>
</Data>
</QuoteApiModel>

What is web scraping?

  • Web Scraping goes by many names - Screen Scraping, Web Data Extraction, Web Harvesting
  • It is the process of directly accessing webpages to copy data when APIs are not provided
  • It automates the process of copying the data and saving it locally
  • It can be a vioation of terms of service
  • It can have legal repercussions
  • Even when it isn't a violation/ illegal, it is ethically ambiguous
  • It can look like a DOS and can overwhelm resources making them unavailable to others

When should I web scrape?

"IF YOU NEED A SCRAPER, YOU HAVE A DATA PROBLEM."

David Eads, PyData DC, 10/9/16

NEVERSCRAPE UNLESS YOU...

  • have no other way of liberating the data
  • budget appropriately
  • consider the ethical ramifications (do some googling and read about the ethics)
  • read terms of service and do your legal research
  • talk to a lawyer (if you possibly can)

meme

How do I web scrape?

There are a lot of libraries availble in Python to help with this task:

  • urllib2 - module for processing urls, expanded in Python 3
  • requests - improved upon urllib/ urllib2
  • lxml - extensive library for parsing XML and HTML, some people find it a little more difficult to use initially
  • beautifulsoup4 - library for pulling data out of HTML and XML files

There is usually more than one way to do things in Python. What you end up with tends to be what you learn first and/ or what you are most comfortable with.

Expect a fair bit of trial and error. Web scraping is not straightforward and not for the faint of heart. It is a last resort.

Resources

Command Line

Git and GitHub

There are a lot of online resources for using Git and GitHub. Here is a good one to start with An Intro to Git and GitHub for Beginners (Tutorial).

Python

There are a lot of Python references available. I used Learn Python the Hard Way and Automate the Boring Stuff with Python when I was first learning. I like the challenges on HackerRank to practice. The discussion on each challenge is great for seeing how other people approach the problem and to ask questions. The free beginner lesson on DataQuest are also great if you are just starting out or want more practice.

More Web Scraping

Ethics: I encourage you to google the ethics of web scraping and read a bit about it before plunging into scraping.

Check out this GitHub Repo with a tutorial for writing your first scraper. Jackie Kazil, of PyLadies DC and Women Data Scientists DC, contributed to the project.

Scrapy is a Python library that provides an application framework for scraping.

In June, David Eads from the NPR Visuals Team made a blog post on Useful Scraping Techniques. It is pretty amazing. He presented it at PyData DC so check back at that link for a recording of his talk (they should be posted by Thanksgiving). In the mean time, you can check out his slides.

You can find my initial DOH scraping efforts on my GitHub. I am attempting to build a sustainable scraper through Code for DC. If you want to help, come to a meetup

Presentation

I created this presentation using Jupyter Notebook and reveal.js. It is being hosted as a GitHub Project Page. I found a few resources out there for how to do this such as Presentation slides with Jupyter Notebook and Deploy reveal.js slideshow on github-pages.

Stay in touch!