Skip to content

Commit 9d281db

Browse files
committed
enlarge figure
1 parent ae6c05a commit 9d281db

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,39 +5,47 @@ The official implementation of the **NAACL-HLT 2019 oral** paper "[Microblog Has
55
## Data
66
Due to the copyright issue of TREC 2011 Twitter dataset, we only release the Weibo dataset (in `data/Weibo`). For more details about the Twitter dataset, please contact [Yue Wang](https://yuewang-cuhk.github.io/) or [Jing Li](https://girlgunner.github.io/jingli/).
77

8-
### Weibo data format
8+
### Data format
99
* The dataset is randomly splited into three segments (80% training, 10% validation, 10% testing)
1010
* For each segment (train/valid/test), we have post, its conversation and corresponding hashtags (one line for each instance)
1111
* For multiple hashtags for one post, hashtags are seperated by a semicolon ";"
1212

1313
### Data statistics
1414
We first present some statistics of the two datasets, including number of posts and the average length (i.e., token number) of post, conversation, and hashtags.
1515

16+
<center>
17+
1618
Datasets | # of posts | Avg len of posts | Avg len of convs | Avg len of tags | # of tags per post
1719
--- | --- | --- | --- | --- | ---
1820
Twitter | 44,793 | 13.27 | 29.94 | 1.69 | 1.14
1921
Weibo | 40,171 | 32.64 | 70.61 | 2.70 | 1.11
2022

23+
</center>
24+
2125
We further analyze the detailed statistics of the hashtags below, including size of all the unique hashtags, the proportion of hashtags appearing in the post (**P**), conversation (**C**), and the union set of them (**P&C**).
2226

27+
<center>
28+
2329
Datasets | Size of Tagset | P | C | P&C
2430
--- | --- | --- | --- | ---
2531
Twitter | 4,188 | 2.72% | 5.58% | 7.69%
2632
Weibo | 5,027 | 8.29% | 6.21% | 12.52%
2733

34+
</center>
35+
2836
The distribution of hashtags frequency is depicted below. (The script for drawing this figure is in my [DrawFigureForPaper](https://github.com/yuewang-cuhk/DrawFigureForPaper) repo)
2937

3038
<p align="center">
3139
<img src="https://github.com/yuewang-cuhk/HashtagGeneration/blob/master/hashtag_distribution.PNG" alt="The overall architecture" width="500"/>
3240
</p>
3341

34-
From such analysis, we can conclude that these two datasets have a *very low present hashtag rate* (unsuitable for extraction model) and the hashtag space is *large and imbalanced* (unsuitable for classification model).
42+
From such analysis, we can conclude that these two datasets have a **very low present hashtag rate** (unsuitable for extraction model) and the hashtag space is **large and imbalanced** (unsuitable for classification model).
3543

3644
## Model
37-
Our model uses a dual encoder to encode the user posts and its replies, followed by a bi-attention to capture their interactions. The extracted feature are further merged and fed into the hashtag decoder. The overall architecture is depicted below:
45+
Our model uses a dual encoder to encode the user posts and its replies, followed by a bi-attention to capture their interactions. The extracted features are further merged and fed into the hashtag decoder. The overall architecture is depicted below:
3846

3947
<p align="center">
40-
<img src="https://github.com/yuewang-cuhk/HashtagGeneration/blob/master/model.png" alt="The overall architecture" width="500"/>
48+
<img src="https://github.com/yuewang-cuhk/HashtagGeneration/blob/master/model.png" alt="The overall architecture" width="600"/>
4149
</p>
4250

4351
## Code

0 commit comments

Comments
 (0)