Everything for this workshop is available online here
Nicole Donnelly, District Data Labs
“In the simplest terms, APIs are sets of requirements that govern how one application can talk to another.”
“APIs are what make it possible to move information between programs....”
“RSS is an XML-based vocabulary for distributing Web content in opt-in feeds. Feeds allow the user to have new content delivered to a computer or mobile device as soon as it is published. ”
Source: http://searchwindevelopment.techtarget.com/definition/RSS
There are different versions - RSS 1.0, RSS 2.0, ATOM
Additional information on RSS: https://www.mnot.net/rss/tutorial/
"RSS is in some sense a specific API in its own right – with specific semantics for calls (the “latest” X items on this topic) and standard formats for the data being returned. This makes the content from this API accessible to many thousands of reader implementations."
Context: I will focus on RSS and REST APIs with JSON.
Read more about API types here
http://dvd.netflix.com/NewReleasesRSS
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel >
<title>New Releases This Week</title>
<ttl>10080</ttl>
<link>http://dvd.netflix.com</link>
<description>New movies at Netflix this week</description>
<language>en-us</language>
<cf:treatAs xmlns:cf="http://www.microsoft.com/schemas/rss/core/2005">list</cf:treatAs>
<atom:link href="http://dvd.netflix.com/NewReleasesRSS" rel="self" type="application/rss+xml"/>
<item>
<title>Bakery in Brooklyn</title>
<link>https://dvd.netflix.com/Movie/Bakery-in-Brooklyn/80152426</link>
<guid isPermaLink="true">https://dvd.netflix.com/Movie/Bakery-in-Brooklyn/80152426</guid>
<description><a href="https://dvd.netflix.com/Movie/Bakery-in-Brooklyn/80152426"><img src="//secure.netflix.com/us/boxshots/small/80152426.jpg"/></a><br>Vivien and Chloe have just inherited their Aunt's bakery, a boulangerie that has been a cornerstone of the neighborhood for years. Chloe wants a new image and product, while Vivien wants to make sure nothing changes. Their clash of ideas leads to a peculiar solution, they split the shop in half. But Vivien and Chloe will have to learn to overcome their differences in order to save the bakery and everything that truly matters in their lives.</description>
</item>
Context: I will focus on RSS and REST APIs with JSON.
Read more about API types here
http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL
{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.12,"Change":-0.349999999999994,"ChangePercent":-0.297948412360598,"Timestamp":"Wed Oct 19 00:00:00 UTC-04:00 2016","MarketCap":631094444160,"Volume":20034594,"ChangeYTD":105.26,"ChangePercentYTD":11.2673380201406,"High":117.76,"Low":113.8,"Open":117.25}}
A complete list can be found here: HTTP Statuses.
A response of 200 Ok
means that the request was successful. Other common status codes include:
http://api.dp.la/v2/items?api_key=0123456789&q=goats+AND+cats
{"count":29,
"start":0,
"limit":10,
"docs":[{"@context":"http://dp.la/api/items/context","isShownAt":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","dataProvider":"Missouri State Archives through Missouri Digital Heritage","@type":"ore:Aggregation","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"hasView":{"@id":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"},"object":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","ingestionSequence":12,"id":"9e05f398ca95f9bbfd733e6d3493fd74","ingestDate":"2016-10-11T13:21:48.399681Z","_rev":"7-6bee4d18708d1d16efceeea1e061b316","aggregatedCHO":"#sourceResource","_id":"missouri--urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","sourceResource":{"title":["Alabama Big Cats Safari Adventure"],"description":["Children bottle feeding goats"],"subject":[{"name":"Transparencies, Slides"},{"name":"Tourist Destination"}],"rights":["Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives."],"relation":["Division of Tourism Photograph Collection"],"language":[{"iso639_3":"eng","name":"English"}],"format":"Image","collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"Mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"stateLocatedIn":[{"name":"Missouri"}],"@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74#sourceResource","identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"creator":["GD"]},"admin":{"validation_message":null,"sourceResource":{"title":"Alabama Big Cats Safari Adventure"},"valid_after_enrich":true},"ingestType":"item","@id":"http://dp.la/api/items/9e05f398ca95f9bbfd733e6d3493fd74","originalRecord":{"id":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","provider":{"@id":"http://dp.la/api/contributor/missouri-hub","name":"Missouri Hub"},"collection":{"id":"594a2b3666ab0c55245f6640555554cd","description":"","title":"mdh_divtour","@id":"http://dp.la/api/collections/594a2b3666ab0c55245f6640555554cd"},"header":{"expirationdatetime":"2016-10-08T17:04:17Z","datestamp":"2016-10-04T13:19:05Z","identifier":"urn:data.mohistory.org:mdh_all:oai:cdm16795.contentdm.oclc.org:divtour/88","setSpec":"mdh_divtour"},"metadata":{"mods":{"accessCondition":"Copyright is in the public domain. Items reproduced for publication should carry the credit line: Courtesy of the Missouri State Archives.","location":{"url":[{"#text":"http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88","access":"object in context"},{"#text":"http://data.mohistory.org/files/thumbnails/cdm16795_contentdm_oclc_org568ad334407e0.jpg","access":"preview"}]},"subject":[{"topic":"Transparencies, Slides"},{"topic":"Tourist Destination"}],"name":{"namePart":"GD","role":{"roleTerm":"creator"}},"relatedItem":{"titleInfo":{"title":"Division of Tourism Photograph Collection"}},"physicalDescription":{"note":"Image"},"xmlns":"http://www.loc.gov/mods/v3","language":{"languageTerm":"eng"},"titleInfo":{"title":"Alabama Big Cats Safari Adventure"},"identifier":["001_070","http://cdm16795.contentdm.oclc.org/cdm/ref/collection/divtour/id/88"],"note":["Children bottle feeding goats",{"#text":"Missouri State Archives through Missouri Digital Heritage","type":"ownership"}]}}},"score":4.534843}, ...
"facets":[]}
For the purposes of this tutorial, you will see info in the notebook on creating a json file with your keys. If you follow a naming convention for your API key files, you can easily add them to your .gitignore file to avoid inadvertantly uploading them to git.
http://dev.markitondemand.com/Api/Quote/json?symbol=AAPL
{"Data":{"Status":"SUCCESS","Name":"Apple Inc","Symbol":"AAPL","LastPrice":117.06,"Change":-0.0600000000000023,"ChangePercent":-0.0512295081967233,"Timestamp":"Thu Oct 20 00:00:00 UTC-04:00 2016","MarketCap":630771137580,"Volume":24059570,"ChangeYTD":105.26,"ChangePercentYTD":11.2103363100893,"High":117.38,"Low":116.33,"Open":116.86}}
http://dev.markitondemand.com/Api/Quote/xml?symbol=AAPL
<QuoteApiModel>
<Data>
<Status>SUCCESS</Status>
<Name>Apple Inc</Name>
<Symbol>AAPL</Symbol>
<LastPrice>117.06</LastPrice>
<Change>-0.06</Change>
<ChangePercent>-0.0512295082</ChangePercent>
<Timestamp>Thu Oct 20 00:00:00 UTC-04:00 2016</Timestamp>
<MarketCap>630771137580</MarketCap>
<Volume>24059570</Volume>
<ChangeYTD>105.26</ChangeYTD>
<ChangePercentYTD>11.2103363101</ChangePercentYTD>
<High>117.38</High>
<Low>116.33</Low>
<Open>116.86</Open>
</Data>
</QuoteApiModel>
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize web sites. Not all robots cooperate with the standard; email harvesters, spambots, malware, and robots that scan for security vulnerabilities may even start with the portions of the website where they have been told to stay out. The standard is different from, but can be used in conjunction with, Sitemaps, a robot inclusion standard for websites.
Source: Wikipedia
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
There are a lot of libraries availble in Python to help with this task:
There is usually more than one way to do things in Python. What you end up with tends to be what you learn first and/ or what you are most comfortable with.
Expect a fair bit of trial and error. Web scraping is not straightforward and not for the faint of heart. It is a last resort.
Friday 4:30 p.m.–5 p.m.
Building A Gigaword Corpus: Lessons on Data Ingestion, Management, and Processing for NLP
Saturday 2:35 p.m.–3:05 p.m.
Human-Machine Collaboration for Improved Analytical Processes
Sunday Morning, Expo Hall
On the Hour Data Ingestion from the Web to a Mongo Database
Model Management Systems: Scikit-Learn and Django
Yellowbrick: Steering Scikit-Learn with Visual Transformers
A Framework for Exploratory Data Analysis with Python