crisis-data/codebook.txt at master · rosscg/crisis-data · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
This project stores data collected from Twitter as follows:


Event
Limited to one instance.
Stores the earliest start time and latest end time of the gps and keyword streams.
  Note: The streams may overrun the stored end times due to delay in the queue. To check.
The event start and end times are used in the save_user_timelines function.


GeoPoint
Stores latitude and longitude.
Associated with Event model. Limited to two instances (i.e. bounding box).
If only one point is provided, the geo stream will create a bounding box based on values of BOUNDING_BOX_WIDTH and BOUNDING_BOX_HEIGHT.


Keyword
Keywords used to track via the Twitter stream. These can be added manually, or automatically added based on their prevalence in streamed Tweets.
Priority 1 runs on a 'low-priority' stream which saves Tweets when processes are available.
Priority 2 returns all objects returned by Twitter.
Associated with Event model.


User
Twitter user object, added manually or via the data collection.
Note that id is an autofield generated by Django, and the object ID from Twitter is stored as user_id instead
user_class represents the relevance of the user to the collection as follows:
  2 = Ego: Tweeted a message identified in the stream.
  1 = A class 0 user which has been identified as 'non-spam' account and promoted. 'Spam accounts' also refer to celebrities and news outlets and are identified as having a high follower/following/tweet value as set in config.py
  0 = Alter: Follows or is followed by an ego user. Also users which are quoted by or replied to by an ego user. Unsorted.
  -1 = A class 0 user which has been identified as a 'spam' account.
added_at: time user was first added to database
data_source: the highest source of the user's observed Tweets. [TODO: effectively duplicated, and may be imputed by running through their tweets.data_source values]
old_screen_name: account has changed screen name since original collection. This is the screen name at the time of collection if changed.
is_deleted / is_deleted_observed: account has been observed as deleted since initial collection.
user_following / user_followers: temporary array of user ids, used to add users as objects after collection.
user_following_update / user_followers_update / user_network_update_observed_at: updated list of above values, and date at which it was observed.
in_degree / out_degree: currently only represent the degrees to ego accounts and are therefore only relevant to alter objects, or egos with relationships with other egos.
centrality measures: various measures calculated on recorded follower/following network, null if not part of principal component.
tweets_per_hour: Tweets created over the recording period
ratio_original: ratio of original Tweets (vs. replies and quotes) over the recording period. Note, retweets are not recorded so not included in these calculations.
ratio_detected: ratio of Tweets detected by the system against total over period (excl. RTs). Note: the system will not detect all eligible Tweets by a user, so use this metric with caution.
ratio_media: ratio with attached media objects.


Tweet
Tweet object collected via the stream, or lookup.
Author is represented as a User object FK.
Note that id is an autofield generated by Django, and the ID from Twitter is stored as tweet_id instead
data_source:
  0 = Rest API, or quotes/replied_to Tweets
  1 = Low-priority keyword stream,
  2 = priority keyword stream,
  3 = gps stream (coordinates),
  4 = gps stream with Place object but no coordinates (i.e. less accurate, and place coordinates may exceed the bounding box set by user.). TODO: Currently in testing.
If media is downloaded, the filenames and datatype are stored as strings.


DataCodeDimension
DataCodes belong to a 'DataCodeDimension', to allow data to be coded in more than one category (i.e. dimension).
A DataCodeDimension is associated with either 'tweet' or 'user' models via the coding_subject string value.


DataCode
List of codes for classifying Tweets/Users. ManyToManyField with Tweet/User model and the Coder intermediate model. Uses a custom id value that is manually set as the PK doesn't reset when rows are deleted, and therefore causes issues with the button interface. Current support for up to 10 codes (per dimension), any more will work but won't be assigned hotkeys in the coding interface, however there's currently a total limit of 100 (easily adjusted in Views.py 'add_data_code' range argument).


Coding
Intermediate model between DataCode and Tweet/User which identifies which coder assigned the code. Multiple coders can be used for validation purposes.


Hashtag
Hashtags from collected Tweets (represented as the FK).


Url
URLS attached to collected Tweets (represented as the FK).


Mention
Mentioned user names (as strings) from collected Tweets (represented as the FK).


Relo
Relationship between two users as foreign keys. Stores the date the relationship was observed, and a possible date when the relationship was observed (i.e. after the relationship was cancelled). Note that these dates refer to the time of observation, not the time of relationship creation/end.


AccessToken
Access tokens used for data collection stored with their associated screen name.
Each stream currently uses a separate token, therefore at least three tokens are required.


ConsumerKey
Consumer key and secret used for data collection.


CeleryTask
Stores Celery task object IDs, used to track and terminate running tasks.
CURRENTLY BEING DEPRECATED