-
Notifications
You must be signed in to change notification settings - Fork 12
Twitter Client
Welcome to part 2 of the OpenTechSchool Python Beginner series. In this part we will learn how to use Python to retrieve data from the internet.
If you participated in part 1 then you already have everything you need. This tutorial only requires an installation of Python (we use 2.7) and a web browser.
Back in the first session we introduced three of the most common data types used in programming: numbers, strings and booleans. We assigned those data types to variables one-by-one, like so:
>>> x = 3 # numbers
>>> a = "gorillas" # strings
>>> t = True # booleans
But what if we need something more complicated, like a shopping list? Assigning a variable for every item in the list would makes things very complicated:
>>> item_1 = "milk"
>>> item_2 = "cheese"
>>> item_3 = "bread"
Fortunately we don't have to do this. Instead, we have the list
data type. An empty list is simply []
>>> shopping_list = []
When you are in the Python interpreter you can see what is inside a list by just typing the name of the list. For example:
>>> shopping_list
[]
The interpreter shows us that the list is empty.
Now we can add items to shopping_list
. Try typing the following
commands into the Python interpreter.
>>> shopping_list.append("milk")
>>> shopping_list.append("cheese")
>>> shopping_list.append("bread")
What is in the shopping list? What happens when you append numbers or booleans to the list?
To remove an item from the list we use remove()
:
>>> shopping_list.remove("milk")
Lists can easily be processed in a for
loop. Have a look at this
example which prints each item of the list in a new row:
>>> for item in shopping_list:
>>> print(item)
And that's it! Lists are the most common data structure in programming. There are lots of other things you can do with lists, and all languages have their own subtly different interpretation. But fundamentally they are all very similar.
In summary:
>>> shopping_list = []
>>> shopping_list.add("cookies")
>>> shopping_list.remove("cookies")
The other main data type is the dictionary. The dictionary allows you to associate one piece of data with another. The analogy comes from real-life dictionaries, where we associate a word with it's meaning. It's a little harder to understand than a list, but Python makes them very easy to deal with.
You can create a dictionary with {}
>>> foods = {}
And you can add items to the dictionary like this:
>>> foods["banana"] = "A delicious and tasty treat!"
>>> foods["dirt"] = "Not delicious. Not tasty. DO NOT EAT!"
As with lists, you can always see what is inside a dictionary:
>>> foods
{'banana': 'A delicious and tasty treat!', 'dirt': 'Not delicious. Not tasty. DO NOT EAT!'}
And you can also delete from a dictionary as well. We don't really need to include an entry for dirt:
>>> del foods["dirt"]
What makes dictionaries so useful is that we can give meaning to the items within them. A list is just a bag of things, but a dictionary is a specific mapping of something to something else. By combining lists and dictionaries you can describe basically any data structure used in computing.
Outside of Python, dictionaries are often called hash tables
,
hash maps
or just maps
.
Get tweets from Twitter with your browser.
Open this URL in your browser: http://search.twitter.com/search.json?q=python&rpp=1
You will see a chunk of data which is not very readable, but it actually contains one tweet related to the keyword python. You can copy and paste the data into a tool like jsonlint.com to make it more readable.
Queries to Twitter's search API consist of a base part and parameters which
most of are optional. The base part http://search.twitter.com/search.json?
is always the same. It is followed by the parameters in form of key value pairs
key=value
. The key value pairs are separated by the &
character.
The example query from this exercise consists of the base part
http://search.twitter.com/search.json?
, the required parameter q
(query) with value python
and the optional parameter rpp
(results per
page) with value 1
.
Exercise: Change the rpp
parameter to different values and see how the
amount of data varies.
Get the raw data from Twitter into Python.
Type the following code into the Python interpreter.
>>> import urllib2
>>> response = urllib2.urlopen('http://search.twitter.com/search.json?q=python&rpp=1')
>>> raw_data = response.read()
>>> print(raw_data)
At first, we import the module urllib2
from the Python Standard Library, a
handy tool for fetching data from the World Wide Web.
Then we open the query URL and store the response in the variable response
.
Because the response is actually a file-like object, we use it's read()
method to access the data and store it in the variable raw_data
.
Finally we print the data what will look very similar to what you have seen in the browser.
Convert the text-based data into a accessible data structure.
>>> import json
>>> data = json.loads(raw_data)
>>> print(data.keys())
>>> print(data['query'])
>>> tweets = data['results']
>>> print(len(tweets))
>>> first_tweet = tweets[0]
>>> print(first_tweet.keys())
>>> print(first_tweet['text'])
Twitter uses the JSON notation to format the
response. Fortunately the Python standard library contains a JSON parser which
does all the work for us. After importing it, we can use json.loads()
to
convert a JSON string into data structure consisting of lists and dictionaries.
Examine the keys of the dictionary. The key query
contains the query string
we sent to twitter: python. But much more interesting is results
. It
contains a list of tweets which matched our query. It's length should be equal
to the value of the rpp
parameter in the query.
Each tweet is again a dictionary containing various information about it and of course the message itself.
Exercise: Print the username (from_user
) from each tweet followed by
the message with a for loop.
Create a function which calls the API and returns the tweets.
import json
import urllib2
def fetch_tweets(query, rpp=100):
url = 'http://search.twitter.com/search.json?q={0}&rpp={1}'.format(query, rpp)
response = urllib2.urlopen(url)
raw_data = response.read()
data = json.loads(raw_data)
return data['results']
>>> import twitter
>>> tweets = twitter.fetch_tweets('python', 100)
You don't want to type in these commands into the interpreter every time, so
let's create a function for this. Open a new file twitter.py
and paste the
first code block into it.
The function has two arguments: query
and rpp
. Note the default value
of rpp
- it's used when the argument is omitted.
To construct the query URL, we use Python's string formating. {0}
is
replaced by the format()
function's fist argument, {1}
by the second.
Now you can import your twitter module and get the tweets with a single function call.
Exercise: Use urllib.urlencode()
to construct the query URL.
Now let's start to play with the data we got. You might have already noticed
that people share links in their tweets. But how often? To find this out, we
need a new operator: in
. You already know ==
, <
and >
from the first session. They return True or False depending on whether the
condition is matched or not. The in
operator returns True if something is
contained in a list or a string. Therefore its also called a containment
operator.
>>> shopping_list = ["bread", "milk", "butter"]
>>> "milk" in shopping_list
True
>>> "fun" in "Python is fun"
True
Knowing that, we can loop over the tweets and count how many contain a link by checking if the text contains http
>>> num_links = 0
>>> for tweet in tweets:
... if "http" in tweet["text"]:
... num_links += 1
There might be other interesting words to count. Play around with this and maybe write a function that accepts a list of tweets and a string as arguments and returns how often this string occurs.
Python has a some functions that help handling strings. You can find a complete list in the documentation but for now the following will be sufficient:
>>> s = "Python is fun"
>>> s.startswith("Py") # Return True if the string starts with the given string
True
>>> s.lower() # Converts all letters to lower case
'python is fun'
>>> s.count("n") # Count the number of occurences of the given string
2
>>> s.split(" ") # Return a list of substrings using the given delimiter
['Python', 'is', 'fun']
What could we do with these? A very important aspect of Twitter are hashtags. They are words prefixed with a # sign and are used to group tweets by topic. Let's find all hashtags!
>>> all_words = []
>>> for tweet in tweets:
... words = tweet["text"].split(" ")
... for word in words:
... all_words.append(word)
...
>>> hashtags = []
>>> for word in all_words:
... if word.startswith("#"):
... hashtags.append(word)
...
For sure some hashtags are used more often than others. What about some statistics?
>>> hashtag_stats = {}
>>> for hashtag in hashtags:
... if hashtag in hashtag_stats:
... old_value = hashtag_stats[hashtag]
... hashtag_stats[hashtag] = old_value + 1
... else:
... hashtag_stats[hashtag] = 1
Did you recognize that hashtags occur multiple time in our statistic when they
are written in lowercase, uppercase or a mix of both? Use the lowercase()
function to make them equal.
It's quite common to store data from the web on the local disc. Do to this, we
need Python's file functions. To open a file for writing into it, use
open("filename.txt", "w")
. If the first argument is just a file name, Python
uses the current directory (the directory where you invoked the python
command).The second argument "w"
indicates that you want to write to the
file. If you want to read it, use "r"
instead. After you wrote or read the file,
you have to close it.
>>> f = open("test.txt", "w")
>>> f.write("Just a few words")
>>> f.close()
>>> f = open("test.txt", "w")
>>> text = f.read()
>>> f.close()
We can use this to store the profile image of a tweet's author on our disc.
Load the image's URL with urllib2.urlopen()
, read the data and write it to a
file opened in write mode.
>>> url = "https://si0.twimg.com/profile_images/2434435324/6dam4pvl7agywxsxwc5c_normal.png"
>>> response = urllib2.urlopen(url)
>>> raw_data = response.read()
>>> image = open("profile_image.png", "w")
>>> image.write(raw_data)
>>> image.close()