Drawing

Data Collection: APIs and Web Scraping in Python ¶

Nicole Donnelly

TechLady Hackathon 4 logo created by @MarieCWhittaker

Hi.¶

Everything for this workshop is available online here https://github.com/nd1/tlh4_workshop

Who Are You? ¶

How many people know python? How many people have used APIs? Anybody ever try web scraping?

Who Am I? ¶

I have been using Python for a little over a year. I am a recovering consultant. I am learning to be a data scientist. I really love data.

I know a lot of people, especially here, probably love data. But I really love data.

Data is my happy place.

Some people nurture plants, or animals, or children.

I want to nurture data.

I want it to become the best data it can be. I want it to help people. I want it to be happy.

In my past life as a consultant, I spent a lot of time with data too. I worked in computer forensics and electronic discovery. I collected data, inventoried it, organized it, gave it context, analyzed it, reconstructed it, reported on it, and put it in useful formats for people to review and use. I was frequently asked to help inventory and organize data for other projects because I am good at it and I enjoy it.

I decided to make a career change to data science. I completed the professional certificate in data science at Georgetown and I am currently the TA. I also completed the Data Science Immersive at Genral Assembly.

Currently seeking full time employment!

Agenda¶

An Overview¶

What are APIs?
Where are APIs?
How do I access APIs?
What is the API giving me?
What is web scraping?
When should I web scrape?
How do I web scrape?

Some Hands-on¶

Workshop Notebook
Python Scripts

What are APIs?¶

Application Programming Interface¶

“In the simplest terms, APIs are sets of requirements that govern how one application can talk to another.”

“APIs are what make it possible to move information between programs....”

Source: http://readwrite.com/2013/09/19/api-defined

Where are APIs?¶

apis

How do I access APIs?¶

Context: I will focus on REST APIs with JSON.

Read more about API types here

http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL

{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.12,"Change":-0.349999999999994,"ChangePercent":-0.297948412360598,"Timestamp":"Wed Oct 19 00:00:00 UTC-04:00 2016","MarketCap":631094444160,"Volume":20034594,"ChangeYTD":105.26,"ChangePercentYTD":11.2673380201406,"High":117.76,"Low":113.8,"Open":117.25}}

http://api.dp.la/v2/items?api_key=0123456789&q=goats+AND+cats

{"count":29,
"start":0,
"limit":10,
"docs":[{"@context":"http://dp.la/api/items/context","isShownAt":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","dataProvider":"Missouri State Archives through Missouri Digital Heritage","@type":"ore:Aggregation","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"hasView":{"@id":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"},"object":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","ingestionSequence":12,"id":"9e05f398ca95f9bbfd733e6d3493fd74","ingestDate":"2016-10-11T13:21:48.399681Z","_rev":"7-6bee4d18708d1d16efceeea1e061b316","aggregatedCHO":"#sourceResource","_id":"missouri--urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","sourceResource":{"title":["Alabama Big Cats Safari Adventure"],"description":["Children bottle feeding goats"],"subject":[{"name":"Transparencies, Slides"},{"name":"Tourist Destination"}],"rights":["Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives."],"relation":["Division of Tourism Photograph Collection"],"language":[{"iso639_3":"eng","name":"English"}],"format":"Image","collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"Mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"stateLocatedIn":[{"name":"Missouri"}],"@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74#sourceResource","identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"creator":["GD"]},"admin":{"validation_message":null,"sourceResource":{"title":"Alabama Big Cats Safari Adventure"},"valid_after_enrich":true},"ingestType":"item","@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74","originalRecord":{"id":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"header":{"expirationdatetime":"2016-10-08T17:04:17Z","datestamp":"2016-10-04T13:19:05Z","identifier":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","setSpec":"mdh_divtour"},"metadata":{"mods":{"accessCondition":"Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives.","location":{"url":[{"#text":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","access":"object in context"},{"#text":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","access":"preview"}]},"subject":[{"topic":"Transparencies, Slides"},{"topic":"Tourist Destination"}],"name":{"namePart":"GD","role":{"roleTerm":"creator"}},"relatedItem":{"titleInfo":{"title":"Division of Tourism Photograph Collection"}},"physicalDescription":{"note":"Image"},"xmlns":"http://www.loc.gov/mods/v3","language":{"languageTerm":"eng"},"titleInfo":{"title":"Alabama Big Cats Safari Adventure"},"identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"note":["Children bottle feeding goats",{"#text":"Missouri State Archives through Missouri Digital Heritage","type":"ownership"}]}}},"score":4.534843}, ... 
"facets":[]}

What is the API giving me?¶

http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL

{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.06,"Change":-0.0600000000000023,"ChangePercent":-0.0512295081967233,"Timestamp":"Thu Oct 20 00:00:00 UTC-04:00 2016","MarketCap":630771137580,"Volume":24059570,"ChangeYTD":105.26,"ChangePercentYTD":11.2103363100893,"High":117.38,"Low":116.33,"Open":116.86}}

http://dev.markitondemand.com/Api/Quote/xml?symbol=AAPL

<QuoteApiModel>
<Data>
<Status>SUCCESS</Status>
<Name>Apple Inc</Name>
<Symbol>AAPL</Symbol>
<LastPrice>117.06</LastPrice>
<Change>-0.06</Change>
<ChangePercent>-0.0512295082</ChangePercent>
<Timestamp>Thu Oct 20 00:00:00 UTC-04:00 2016</Timestamp>
<MarketCap>630771137580</MarketCap>
<Volume>24059570</Volume>
<ChangeYTD>105.26</ChangeYTD>
<ChangePercentYTD>11.2103363101</ChangePercentYTD>
<High>117.38</High>
<Low>116.33</Low>
<Open>116.86</Open>
</Data>
</QuoteApiModel>

What is web scraping?¶

Web Scraping goes by many names - Screen Scraping, Web Data Extraction, Web Harvesting
It is the process of directly accessing webpages to copy data when APIs are not provided
It automates the process of copying the data and saving it locally

It can be a vioation of terms of service
It can have legal repercussions
Even when it isn't a violation/ illegal, it is ethically ambiguous
It can look like a DOS and can overwhelm resources making them unavailable to others

When should I web scrape?¶

"IF YOU NEED A SCRAPER, YOU HAVE A DATA PROBLEM."¶

David Eads, PyData DC, 10/9/16 ¶

NEVERSCRAPE UNLESS YOU...¶

have no other way of liberating the data
budget appropriately
consider the ethical ramifications (do some googling and read about the ethics)
read terms of service and do your legal research
talk to a lawyer (if you possibly can)

meme

How do I web scrape?¶

There are a lot of libraries availble in Python to help with this task:

urllib2 - module for processing urls, expanded in Python 3
requests - improved upon urllib/ urllib2
lxml - extensive library for parsing XML and HTML, some people find it a little more difficult to use initially
beautifulsoup4 - library for pulling data out of HTML and XML files

There is usually more than one way to do things in Python. What you end up with tends to be what you learn first and/ or what you are most comfortable with.

Expect a fair bit of trial and error. Web scraping is not straightforward and not for the faint of heart. It is a last resort.

Resources¶

Command Line¶

Command Line Crash Course - *nix based users (Linux/ OS X)
Effective Windows PowerShell - Windows users

Git and GitHub¶

There are a lot of online resources for using Git and GitHub. Here is a good one to start with An Intro to Git and GitHub for Beginners (Tutorial).

Python ¶

There are a lot of Python references available. I used Learn Python the Hard Way and Automate the Boring Stuff with Python when I was first learning. I like the challenges on HackerRank to practice. The discussion on each challenge is great for seeing how other people approach the problem and to ask questions. The free beginner lesson on DataQuest are also great if you are just starting out or want more practice.

More Web Scraping¶

Ethics: I encourage you to google the ethics of web scraping and read a bit about it before plunging into scraping.

Check out this GitHub Repo with a tutorial for writing your first scraper. Jackie Kazil, of PyLadies DC and Women Data Scientists DC, contributed to the project.

Scrapy is a Python library that provides an application framework for scraping.

In June, David Eads from the NPR Visuals Team made a blog post on Useful Scraping Techniques. It is pretty amazing. He presented it at PyData DC so check back at that link for a recording of his talk (they should be posted by Thanksgiving). In the mean time, you can check out his slides.

You can find my initial DOH scraping efforts on my GitHub. I am attempting to build a sustainable scraper through Code for DC. If you want to help, come to a meetup

Presentation¶

I created this presentation using Jupyter Notebook and reveal.js. It is being hosted as a GitHub Project Page. I found a few resources out there for how to do this such as Presentation slides with Jupyter Notebook and Deploy reveal.js slideshow on github-pages.

Stay in touch!¶

Twitter: @NicoleADonnelly
GitHub: nd1
LinkedIn: nicoleadonnelly
Email: Nicole Donnelly