Skip to content

Commit e88a04f

Browse files
committed
Twitter scraper Initial commit
0 parents  commit e88a04f

File tree

6 files changed

+321
-0
lines changed

6 files changed

+321
-0
lines changed

README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Twitter Scraper
2+
![License: CC BY-ND 4.0](https://img.shields.io/badge/License-CC%20BY--ND%204.0-lightgrey.svg)
3+
4+
A simple Python based Twitter Scraper with the ability of scripng tweets either by username or by a search query (Supports other search params as well). This program supports `STOP / RESUME` operation as it keeps a log of all the previous position id's.
5+
This project was created inorder to bypass twitter's 7 day policy since it doesn't allow to fetch tweets which are more than 7 days old as I needed some data for my research project.
6+
**Please Note: This is not an alternative for the official API's provided by the twitter**
7+
8+
This project is intented for students, researchers & all those who abide by twitter's data terms and conditions.
9+
10+
## Contents
11+
1. scraper.py
12+
2. searchParams.py
13+
3. tweets.py
14+
4. main.py
15+
5. requirements.txt
16+
17+
## Usage
18+
1. `Scraper.py` contains all the essential code required to grab tweets and store in the csv file
19+
2. `searchParams.py` is a class for initializing seach parameters
20+
3. `tweets.py` is a class whose object for each tweet
21+
4. `main.py` - The main class required for calling all the above code. This file takes multiple arguments and is responsible for initializing all other files.
22+
**HELP**
23+
`python main.py --help`
24+
25+
## Prerequisites & Installation Instructions
26+
This Project is intented to be used with Python `3.x` but feel free to convert it inorder to use it for Python `2.x`.
27+
A **requirements.txt** file is provided with the project, which contains all the essential packages to run this project.
28+
```
29+
pip install -r requirements.txt
30+
```
31+
32+
## Running the code
33+
Use `main.py` to run the code
34+
For help: -> `python main.py --help`
35+
36+
Example:
37+
Search for **github** keyword between **2018-06-15 to 2018-06-20** and save it to **test.csv** with log file as **test.log**.
38+
```
39+
python main.py --searchquery github --since 2018-06-15 --until 2018-06-20 --op test.csv --log test.log
40+
```
41+
42+
## Output
43+
The output of the scraper is saved in the output file provided in the parameters. By default the outpfile file is `op.csv`.
44+
The program also keeps a log of all the previous search positions, and writes it to the logger file provided in the params. By default, the log file is `def_log.log`. This file is required inorder to resume the scraping operation if interrupted in between.
45+
**Note: If you want to `RESUME` your previous incomplete scrape operation, make sure to provide the same log file as you did in the first instance.**
46+
47+
## Feedback & Final Thoughts
48+
Again, this project is intended for Education use. Feel free to use it. You may face cookies problem wherein running your code for the first time will work perfectly fine, but every other time it will fail. So, inorder to fix this, try to use `PROXY`.
49+
`--proxy` parameter can be used to pass proxy ip and port.
50+
E.G: `0.0.0.0:80`
51+
There are lots of free proxy sites out there that you can use.
52+
53+
The code may not be very optimized, so if you tend to find any bug fixes, feature requests, pull requests, feedback, etc., are welcome... If you like this project, please do give it a star.

main.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
import logging
2+
import os
3+
import sys
4+
import click
5+
from scraper import parse_json
6+
from searchParams import SearchParams
7+
8+
9+
@click.command()
10+
@click.option('--searchquery', default=None, help='Query to be searched on twitter')
11+
@click.option('--username', default=None, help='User to search for')
12+
@click.option('--since', default=None, help='Start date in the format yyyy-mm-dd (e.g: 2017-08-25)')
13+
@click.option('--until', default=None, help='End date in the format yyyy-mm-dd (e.g 2019-01-20)')
14+
@click.option('--language', default='en', help='Tweet language to search for')
15+
@click.option('--maxcount', default=0, help='Max number of tweets you want to grab')
16+
@click.option('--proxy', default=None, help='Proxy ip to use')
17+
@click.option('--op', default='op.csv', help='Output file to save the tweets (default: op.csv)')
18+
@click.option('--log', default='def_log.log', help='Log file name to log search index (default: def_log.log)')
19+
def arg_parser(searchquery, username, since, until, language, maxcount, proxy, op, log):
20+
"""
21+
Python based Twitter Scraper. \n
22+
Provide search parameters when running this script. \n
23+
Example: python main.py --searchquery notpetya --since 2017-06-07 --until 2017-07-15 --op notpetya.csv --log test.log
24+
"""
25+
search_parameters = SearchParams()
26+
search_parameters.set_search_query(searchquery)
27+
search_parameters.set_user_name(username)
28+
search_parameters.set_since_date(since)
29+
search_parameters.set_until_date(until)
30+
search_parameters.set_language(language)
31+
search_parameters.set_max_retrieval_count(maxcount)
32+
search_parameters.set_proxy(proxy)
33+
search_parameters.set_op(op)
34+
search_parameters.set_log_file_name(log)
35+
36+
logging.basicConfig(level=logging.DEBUG,
37+
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
38+
datefmt='%m-%d %H:%M',
39+
filename=search_parameters.log_file_name,
40+
filemode='a+')
41+
console = logging.StreamHandler()
42+
console.setLevel(logging.INFO)
43+
formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
44+
console.setFormatter(formatter)
45+
logging.getLogger('').addHandler(console)
46+
search_parameters.set_logger(logging)
47+
48+
with open(search_parameters.op, 'a+') as f:
49+
if os.stat(search_parameters.op).st_size == 0:
50+
f.write('uuid;tweet_id;user_name;screen_name;tweet;date_time;retweet_count;fav_count;link\n')
51+
52+
parse_json(search_parameters)
53+
54+
55+
if __name__ == "__main__":
56+
if sys.version_info[0] < 3:
57+
print('Python 3 not found. Please install Python 3.x and try again')
58+
else:
59+
arg_parser(sys.argv[1:])

requirements.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
beautifulsoup4==4.7.1
2+
bs4==0.0.1
3+
certifi==2018.11.29
4+
cfscrape==1.9.5
5+
chardet==3.0.4
6+
Click==7.0
7+
cssselect==1.0.3
8+
idna==2.8
9+
lxml==4.3.0
10+
numpy==1.15.4
11+
pandas==0.23.4
12+
pyquery==1.4.0
13+
python-dateutil==2.7.5
14+
pytz==2018.9
15+
requests==2.21.0
16+
six==1.12.0
17+
soupsieve==1.7
18+
urllib3==1.24.1

scraper.py

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
import datetime
2+
import http.cookiejar
3+
import json
4+
import re
5+
import uuid
6+
import urllib.parse, urllib.request, urllib.error
7+
from pyquery import PyQuery
8+
from tweets import Tweet
9+
10+
11+
def get_tweets(search_params, current_position):
12+
"""
13+
Build search Query and get the tweets
14+
:param search_params: SearchParams object
15+
:param current_position: Min position where you want to retrieve the tweets from
16+
:return: twitter json_data
17+
"""
18+
base_url = "https://twitter.com/i/search/timeline?f=tweets&q={}&src=typd&{}max_position={}"
19+
query = ''
20+
query = query + (' ' + search_params.search_query) if search_params.search_query else query
21+
query = query + (' from:' + search_params.account_name) if search_params.account_name else query
22+
query = query + (' since:' + search_params.since_date) if search_params.since_date else query
23+
query = query + (' until:' + search_params.until_date) if search_params.until_date else query
24+
lang = ('lang=' + search_params.language + '&') if search_params.language else ''
25+
26+
query = urllib.parse.quote(query)
27+
base_url = base_url.format(query, lang, current_position)
28+
print(base_url)
29+
30+
cookie_jar = http.cookiejar.CookieJar()
31+
headers = [
32+
('Host', "twitter.com"),
33+
('User-Agent', "Mozilla/5.0 (Windows NT 6.1; Win64; x64)"),
34+
('Accept', "application/json, text/javascript, */*; q=0.01"),
35+
('Accept-Language', "en-US;q=0.7,en;q=0.3"),
36+
('X-Requested-With', "XMLHttpRequest"),
37+
('Referer', base_url),
38+
('Connection', "keep-alive")
39+
]
40+
41+
attempts = 0
42+
response = ''
43+
while attempts < 10:
44+
try:
45+
if search_params.proxy:
46+
print('Using IP {}'.format(search_params.proxy))
47+
proxy = urllib.request.ProxyHandler({'http': search_params.proxy, 'https': search_params.proxy})
48+
opener = urllib.request.build_opener(proxy, urllib.request.HTTPCookieProcessor(cookie_jar))
49+
else:
50+
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
51+
opener.addheaders = headers
52+
response = opener.open(base_url)
53+
break
54+
except Exception:
55+
attempts += 1
56+
print('Retrying with different IP !!')
57+
58+
json_res = response.read()
59+
json_data = json.loads(json_res.decode())
60+
return json_data
61+
62+
63+
def parse_json(search_params):
64+
"""
65+
Parse the json tweet
66+
:param search_params: SearchParams object
67+
:return: void
68+
"""
69+
min_position = get_last_search_position(search_params.log_file_name)
70+
count = 0
71+
while True:
72+
json_res = get_tweets(search_params, min_position)
73+
if len(json_res['items_html'].strip()) == 0:
74+
break
75+
76+
min_position = json_res['min_position']
77+
search_params.logging.info('min_pos - {}'.format(min_position))
78+
item = json_res['items_html']
79+
scraped_tweets = PyQuery(item)
80+
scraped_tweets.remove('div.withheld-tweet')
81+
tweets = scraped_tweets('div.js-stream-tweet')
82+
83+
for tweet_html in tweets:
84+
print(count)
85+
tweet_py_query = PyQuery(tweet_html)
86+
name = tweet_py_query.attr("data-name")
87+
screen_name = tweet_py_query.attr("data-screen-name")
88+
tweet_id = tweet_py_query.attr("data-tweet-id")
89+
tweet_text = re.sub(r"\s+", " ",
90+
tweet_py_query("p.js-tweet-text").text().replace('# ', '#').replace('@ ', '@'))
91+
tweet_date_time = int(tweet_py_query("small.time span.js-short-timestamp").attr("data-time"))
92+
tweet_date_time = datetime.datetime.fromtimestamp(tweet_date_time)
93+
retweet_count = int(tweet_py_query("span.ProfileTweet-action--retweet span.ProfileTweet-actionCount").attr(
94+
"data-tweet-stat-count").replace(",", ""))
95+
favorites_count = int(
96+
tweet_py_query("span.ProfileTweet-action--favorite span.ProfileTweet-actionCount").attr(
97+
"data-tweet-stat-count").replace(",", ""))
98+
permalink = 'https://twitter.com' + tweet_py_query.attr("data-permalink-path")
99+
100+
tweet = Tweet(str(uuid.uuid4()), name, screen_name, tweet_id, tweet_text, tweet_date_time, retweet_count,
101+
favorites_count, permalink)
102+
# Now Write to OP or save to DB
103+
write_op(search_params.op, tweet)
104+
count += 1
105+
# sleep(5)
106+
if 0 < search_params.max_retrieval_count <= count:
107+
break
108+
109+
110+
def write_op(op_file, tweet):
111+
"""
112+
Writing tweets to some output file
113+
:param op_file: op_file name
114+
:param tweet: Tweet object
115+
:return: void
116+
"""
117+
with open(op_file, 'a+', encoding='utf-8') as f:
118+
# UUID, tweet_id, user_name, screen_name, tweet, date_time, retweet_count, fav_count, link
119+
f.write(
120+
('%s;%s;%s;%s;%s;%s;%d;%d;%s\n' % (tweet.uuid, tweet.tweet_id, tweet.name, tweet.screen_name, tweet.tweet,
121+
tweet.date_time.strftime("%Y-%m-%d %H:%M"), tweet.retweet_count,
122+
tweet.favourites_count, tweet.link)))
123+
124+
125+
def get_last_search_position(logger_file):
126+
"""
127+
Required for resuming the previous search operation
128+
:param logger_file: Logger file name
129+
:return: Last position id
130+
"""
131+
with open(logger_file, 'r+') as f:
132+
lines = f.read().splitlines()
133+
try:
134+
last_pos = lines[-1].split(' - ')[1]
135+
except IndexError:
136+
last_pos = ''
137+
return last_pos

searchParams.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
class SearchParams:
2+
def __init__(self):
3+
self.max_retrieval_count = 0
4+
self.search_query = None
5+
self.account_name = None
6+
self.since_date = None
7+
self.until_date = None
8+
self.language = None
9+
self.proxy = None
10+
self.op = None
11+
self.logging = None
12+
self.log_file_name = None
13+
# Get log file name from logger: print(logging.root.handlers[0].baseFilename)
14+
15+
def set_max_retrieval_count(self, max_retrieval_count):
16+
self.max_retrieval_count = max_retrieval_count
17+
18+
def set_search_query(self, search_query):
19+
self.search_query = search_query
20+
21+
def set_user_name(self, account_name):
22+
self.account_name = account_name
23+
24+
def set_since_date(self, since_date):
25+
self.since_date = since_date
26+
27+
def set_until_date(self, until_date):
28+
self.until_date = until_date
29+
30+
def set_language(self, language):
31+
self.language = language
32+
33+
def set_proxy(self, proxy):
34+
self.proxy = proxy
35+
36+
def set_op(self, op):
37+
self.op = op
38+
39+
def set_log_file_name(self, log_file_name):
40+
self.log_file_name = log_file_name
41+
42+
def set_logger(self, logger):
43+
self.logging = logger

tweets.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
class Tweet:
2+
def __init__(self, uuid, name, screen_name, tweet_id, tweet, date_time, retweet_count, favourites_count, link):
3+
self.uuid = uuid
4+
self.name = name
5+
self.screen_name = screen_name
6+
self.tweet_id = tweet_id
7+
self.tweet = tweet
8+
self.date_time = date_time
9+
self.retweet_count = retweet_count
10+
self.favourites_count = favourites_count
11+
self.link = link

0 commit comments

Comments
 (0)