You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,39 +5,47 @@ The official implementation of the **NAACL-HLT 2019 oral** paper "[Microblog Has
5
5
## Data
6
6
Due to the copyright issue of TREC 2011 Twitter dataset, we only release the Weibo dataset (in `data/Weibo`). For more details about the Twitter dataset, please contact [Yue Wang](https://yuewang-cuhk.github.io/) or [Jing Li](https://girlgunner.github.io/jingli/).
7
7
8
-
### Weibo data format
8
+
### Data format
9
9
* The dataset is randomly splited into three segments (80% training, 10% validation, 10% testing)
10
10
* For each segment (train/valid/test), we have post, its conversation and corresponding hashtags (one line for each instance)
11
11
* For multiple hashtags for one post, hashtags are seperated by a semicolon ";"
12
12
13
13
### Data statistics
14
14
We first present some statistics of the two datasets, including number of posts and the average length (i.e., token number) of post, conversation, and hashtags.
15
15
16
+
<center>
17
+
16
18
Datasets | # of posts | Avg len of posts | Avg len of convs | Avg len of tags | # of tags per post
17
19
--- | --- | --- | --- | --- | ---
18
20
Twitter | 44,793 | 13.27 | 29.94 | 1.69 | 1.14
19
21
Weibo | 40,171 | 32.64 | 70.61 | 2.70 | 1.11
20
22
23
+
</center>
24
+
21
25
We further analyze the detailed statistics of the hashtags below, including size of all the unique hashtags, the proportion of hashtags appearing in the post (**P**), conversation (**C**), and the union set of them (**P&C**).
22
26
27
+
<center>
28
+
23
29
Datasets | Size of Tagset | P | C | P&C
24
30
--- | --- | --- | --- | ---
25
31
Twitter | 4,188 | 2.72% | 5.58% | 7.69%
26
32
Weibo | 5,027 | 8.29% | 6.21% | 12.52%
27
33
34
+
</center>
35
+
28
36
The distribution of hashtags frequency is depicted below. (The script for drawing this figure is in my [DrawFigureForPaper](https://github.com/yuewang-cuhk/DrawFigureForPaper) repo)
From such analysis, we can conclude that these two datasets have a *very low present hashtag rate* (unsuitable for extraction model) and the hashtag space is *large and imbalanced* (unsuitable for classification model).
42
+
From such analysis, we can conclude that these two datasets have a **very low present hashtag rate** (unsuitable for extraction model) and the hashtag space is **large and imbalanced** (unsuitable for classification model).
35
43
36
44
## Model
37
-
Our model uses a dual encoder to encode the user posts and its replies, followed by a bi-attention to capture their interactions. The extracted feature are further merged and fed into the hashtag decoder. The overall architecture is depicted below:
45
+
Our model uses a dual encoder to encode the user posts and its replies, followed by a bi-attention to capture their interactions. The extracted features are further merged and fed into the hashtag decoder. The overall architecture is depicted below:
0 commit comments