Skip to content

vinaybommana/twitter-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Step one

Collect one lakh tweets on a specific domain name and duration as input

Example

input

sports

output format:
serial_number screen_name user_id tweet_id retweet_count date tweet
1 lorem ip 12 1 23 xh

we'll be using twitterscraper for this purpose.

%%bash
twitterscraper python --limit 1000 --lang en --output ~/backups/today\'stweets.json
INFO: queries: ['python since:2006-03-21 until:2006-11-12', 'python since:2006-11-12 until:2007-07-06', 'python since:2007-07-06 until:2008-02-27', 'python since:2008-02-27 until:2008-10-20', 'python since:2008-10-20 until:2009-06-13', 'python since:2009-06-13 until:2010-02-04', 'python since:2010-02-04 until:2010-09-29', 'python since:2010-09-29 until:2011-05-23', 'python since:2011-05-23 until:2012-01-14', 'python since:2012-01-14 until:2012-09-06', 'python since:2012-09-06 until:2013-04-30', 'python since:2013-04-30 until:2013-12-22', 'python since:2013-12-22 until:2014-08-15', 'python since:2014-08-15 until:2015-04-09', 'python since:2015-04-09 until:2015-12-01', 'python since:2015-12-01 until:2016-07-24', 'python since:2016-07-24 until:2017-03-17', 'python since:2017-03-17 until:2017-11-08', 'python since:2017-11-08 until:2018-07-02', 'python since:2018-07-02 until:2019-02-24']
INFO: Querying python since:2006-03-21 until:2006-11-12
INFO: Querying python since:2006-11-12 until:2007-07-06
INFO: Querying python since:2007-07-06 until:2008-02-27
INFO: Querying python since:2008-02-27 until:2008-10-20
INFO: Querying python since:2008-10-20 until:2009-06-13
INFO: Querying python since:2009-06-13 until:2010-02-04
INFO: Querying python since:2010-02-04 until:2010-09-29
INFO: Querying python since:2011-05-23 until:2012-01-14
INFO: Querying python since:2010-09-29 until:2011-05-23
INFO: Querying python since:2012-01-14 until:2012-09-06
INFO: Querying python since:2012-09-06 until:2013-04-30
INFO: Querying python since:2013-04-30 until:2013-12-22
INFO: Querying python since:2013-12-22 until:2014-08-15
INFO: Querying python since:2014-08-15 until:2015-04-09
INFO: Querying python since:2015-12-01 until:2016-07-24
INFO: Querying python since:2015-04-09 until:2015-12-01
INFO: Querying python since:2017-11-08 until:2018-07-02
INFO: Querying python since:2017-03-17 until:2017-11-08
INFO: Querying python since:2016-07-24 until:2017-03-17
INFO: Querying python since:2018-07-02 until:2019-02-24
INFO: Got 5 tweets for python%20since%3A2006-03-21%20until%3A2006-11-12.
INFO: Got 5 tweets (5 new).
INFO: Got 60 tweets for python%20since%3A2008-10-20%20until%3A2009-06-13.
INFO: Got 65 tweets (60 new).
INFO: Got 54 tweets for python%20since%3A2015-12-01%20until%3A2016-07-24.
INFO: Got 119 tweets (54 new).
INFO: Got 60 tweets for python%20since%3A2017-03-17%20until%3A2017-11-08.
INFO: Got 179 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2010-02-04%20until%3A2010-09-29.
INFO: Got 239 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2013-12-22%20until%3A2014-08-15.
INFO: Got 299 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2010-09-29%20until%3A2011-05-23.
INFO: Got 359 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2014-08-15%20until%3A2015-04-09.
INFO: Got 419 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2006-11-12%20until%3A2007-07-06.
INFO: Got 479 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2015-04-09%20until%3A2015-12-01.
INFO: Got 539 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2009-06-13%20until%3A2010-02-04.
INFO: Got 599 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2007-07-06%20until%3A2008-02-27.
INFO: Got 659 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2018-07-02%20until%3A2019-02-24.
INFO: Got 719 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2011-05-23%20until%3A2012-01-14.
INFO: Got 779 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2016-07-24%20until%3A2017-03-17.
INFO: Got 839 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2013-04-30%20until%3A2013-12-22.
INFO: Got 899 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2017-11-08%20until%3A2018-07-02.
INFO: Got 959 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2012-09-06%20until%3A2013-04-30.
INFO: Got 1019 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2012-01-14%20until%3A2012-09-06.
INFO: Got 1079 tweets (60 new).
INFO: Got 60 tweets for python%20since%3A2008-02-27%20until%3A2008-10-20.
INFO: Got 1139 tweets (60 new).
import codecs
import json
import pandas as pd
from typing import List, Dict

def load_json_file(file_path: str) -> Dict:
    with codecs.open(file_path, "r", "utf-8") as f:
        return json.load(f, encoding="utf-8")
    
tweets = load_json_file("/home/vinay/backups/today\'stweets.json")

list_tweets = [list(elem.values()) for elem in tweets]
list_columns = list(tweets[0].keys())

twitter_data = pd.DataFrame(list_tweets, columns=list_columns)
twitter_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
timestamp url text user html retweets replies fullname id likes
0 2006-11-08T11:46:29 /larskflem/status/59306 coding python. happy time larskflem <p class="TweetTextSize js-tweet-text tweet-te... 0 0 Lars K. Flem 59306 0
1 2006-11-06T21:20:39 /sergio_101/status/57683 Trying to figure out what phone to get next.. ... sergio_101 <p class="TweetTextSize js-tweet-text tweet-te... 0 0 sergio t. ruiz 57683 0
2 2006-10-23T00:21:20 /thomasknoll/status/46836 Learning python while kim watches city of god thomasknoll <p class="TweetTextSize js-tweet-text tweet-te... 0 0 Thomas Knoll 46836 0
3 2006-08-02T02:07:24 /marceloeduardo/status/15613 Finishing some turbogears experience, writing ... marceloeduardo <p class="TweetTextSize js-tweet-text tweet-te... 1 0 Marcelo Eduardo 15613 1
4 2006-07-16T18:03:45 /nitin/status/10584 Heading to peets in emryvil to hack python tnx... nitin <p class="TweetTextSize js-tweet-text tweet-te... 1 1 Nitin Borwankar 10584 1

We can drop columns html, url, likes, replies.

We need to modify timestamp column, add user and fullname columns. and get user_ids of the user.

order the columns, based on the given output format

# making timestamp YYYY-MM-DD
twitter_data['timestamp'] = twitter_data['timestamp'].apply(lambda x: x.split('T')[0])

# dropping html, url, likes and replies
twitter_data.drop(columns=['html', 'url', 'likes', 'replies'], inplace=True)

# twitter_data.head()
twitter_data.columns
Index(['timestamp', 'text', 'user', 'retweets', 'fullname', 'id'], dtype='object')
# renaming column names
twitter_data.columns = ['Date', 'Tweet', 'user', 'retweets', 'fullname', 'Tweet_id']

twitter_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Date Tweet user retweets fullname Tweet_id
0 2006-11-08 coding python. happy time larskflem 0 Lars K. Flem 59306
1 2006-11-06 Trying to figure out what phone to get next.. ... sergio_101 0 sergio t. ruiz 57683
2 2006-10-23 Learning python while kim watches city of god thomasknoll 0 Thomas Knoll 46836
3 2006-08-02 Finishing some turbogears experience, writing ... marceloeduardo 1 Marcelo Eduardo 15613
4 2006-07-16 Heading to peets in emryvil to hack python tnx... nitin 1 Nitin Borwankar 10584

Step 2

from the step 1 output observe( 5th column of the table) i.e number of re tweets obtained for each tweet . If number of re tweets obtained for the given tweet is 0 then discard the tweet other wise print the tweet in the above format.

Output : print only the tweets which got re tweets and discard the tweets with no re tweets

This will contain the tweets with more than zero retweets.

twitter_data = twitter_data[twitter_data.retweets != "0"]
twitter_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Date Tweet user retweets fullname Tweet_id
3 2006-08-02 Finishing some turbogears experience, writing ... marceloeduardo 1 Marcelo Eduardo 15613
4 2006-07-16 Heading to peets in emryvil to hack python tnx... nitin 1 Nitin Borwankar 10584
66 2016-07-23 tethne 0.8.1.dev8: Bibliographic network and c... mastercodeonlin 1 MasterCode.Online 757001950088957952
71 2016-07-23 Thank @mandarlimaye 4 your follow and welcom #... lennincaro 3 Lennin Caro 757000314071478272
72 2016-07-23 Thank @h1ng 4 your follow and welcom #PostgreS... lennincaro 4 Lennin Caro 757000213622030336

for step three

Step 3: Find out number of users who has been tweeted those tweets in step 2, because one user may post multiple tweets.

Input: output of step 2

Output:

serial_number user_name @mention user_id tweets (no of tweets posted by user)
# for step 3 date column is irrelevant
# remove first date column
twitter_data_with_date = twitter_data
twitter_data.drop(columns=['Date', 'Tweet'], inplace=True)
twitter_data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
user retweets fullname Tweet_id
3 marceloeduardo 1 Marcelo Eduardo 15613
4 nitin 1 Nitin Borwankar 10584
66 mastercodeonlin 1 MasterCode.Online 757001950088957952
71 lennincaro 3 Lennin Caro 757000314071478272
72 lennincaro 4 Lennin Caro 757000213622030336
# rather than dropping duplicated we can `groupby` in pandas
# twitter_data.duplicated(subset='user', keep='first').sum()
tweet_count = twitter_data.groupby(twitter_data.user.tolist(),as_index=False).size()
tweet_count['mastercodeonlin']
2
def get_tweet_count(user: str) -> int:
    return tweet_count[user]

get_tweet_count('mastercodeonlin')
2
twitter_data['no_of_tweets'] = twitter_data['user'].apply(lambda x: get_tweet_count(x))

twitter_data_without_tweet_count = twitter_data.drop_duplicates(subset='user', keep="first")
twitter_data_without_tweet_count.reset_index(drop=True, inplace=True)
twitter_data_without_tweet_count.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
user retweets fullname Tweet_id no_of_tweets
0 marceloeduardo 1 Marcelo Eduardo 15613 1
1 nitin 1 Nitin Borwankar 10584 1
2 mastercodeonlin 1 MasterCode.Online 757001950088957952 2
3 lennincaro 3 Lennin Caro 757000314071478272 5
4 devbattles 9 Dev Battles 756996796786900993 2
# in order to get user_id for a user
# we need to use tweepy, need to work on getting user_ids twitterscraper way.

import tweepy

configs = load_json_file("configs.json")

APP_KEY = configs['APP_KEY']
APP_SECRET = configs['APP_SECRET']

# authenticate api
auth = tweepy.AppAuthHandler(APP_KEY, APP_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

if (not api):
    print("Can't Authenticate")
    sys.exit(-1)
# get user_id from screen name
def get_user_id_from_screen_name(screen_name: str, api: object) -> int:
    try:
        id = api.get_user(screen_name=screen_name).id
#         print(id)
        return id
    except tweepy.TweepError:
        return None

get_user_id_from_screen_name("nitin", api)
988
twitter_data_without_tweet_count['user_id'] = twitter_data_without_tweet_count['user'].apply(lambda x: int(get_user_id_from_screen_name(x, api)))
twitter_data_without_tweet_count.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
timestamp url text user html retweets replies fullname id likes user_id
0 2006-11-08T11:46:29 /larskflem/status/59306 coding python. happy time larskflem <p class="TweetTextSize js-tweet-text tweet-te... 0 0 Lars K. Flem 59306 0 11721.0
1 2006-11-06T21:20:39 /sergio_101/status/57683 Trying to figure out what phone to get next.. ... sergio_101 <p class="TweetTextSize js-tweet-text tweet-te... 0 0 sergio t. ruiz 57683 0 2676.0
2 2006-10-23T00:21:20 /thomasknoll/status/46836 Learning python while kim watches city of god thomasknoll <p class="TweetTextSize js-tweet-text tweet-te... 0 0 Thomas Knoll 46836 0 2874.0
3 2006-08-02T02:07:24 /marceloeduardo/status/15613 Finishing some turbogears experience, writing ... marceloeduardo <p class="TweetTextSize js-tweet-text tweet-te... 1 0 Marcelo Eduardo 15613 1 3652.0
4 2006-07-16T18:03:45 /nitin/status/10584 Heading to peets in emryvil to hack python tnx... nitin <p class="TweetTextSize js-tweet-text tweet-te... 1 1 Nitin Borwankar 10584 1 988.0

About

twitter scrapers for analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors