Skip to content

Commit 33b6f7c

Browse files
authored
Update README.md
1 parent f010360 commit 33b6f7c

File tree

1 file changed

+16
-6
lines changed

1 file changed

+16
-6
lines changed

README.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,18 +57,28 @@ If other limitations or errors are found, please open an issue.
5757

5858
## Training data
5959

60-
Initialized with GPT(774M,https://github.com/openai/gpt-2/blob/master/model_card.md). Trained with the following data:
61-
- Sejong Corpus
60+
Initialized with GPT(774M,https://github.com/openai/gpt-2/blob/master/model_card.md).
61+
62+
The following data was used, and is available for redistribution [here](https://static.ksjit.com/datasets):
63+
6264
- Namuwiki database dump, Early 2020
6365
- KCC(Kookmin University Corpus)
6466
- Dump of Korean Wikipedia
65-
- A *PRIVATE* collection of korean novels(with copyright cleared)
66-
- Game storylines (with authors' approval)
6767
- NAVER movie reviews
68-
- Korean news(about 1GB) from a German university
68+
- Korean news(about 1GB) from Leipzig(a German university)
6969
- Context data from KorSQUAD questions
70+
- Parsed CommonCrawl data(WIP)
71+
72+
Please note the completed dataset includes <|endoftext|> tags.
73+
74+
The following data were used, but is unavailable for redistribution:
75+
76+
- Sejong Corpus
7077
- '모두의 말뭉치' from corpus.korean.go.kr
71-
- COBRA webcrawl (10GB)
78+
- A *PRIVATE* collection of korean novels
79+
- Webcrawl of modern, uploaded text novels('텍본' - If you want to prevent your novel from going in the training set, please contact me and I will blacklist it)
80+
- Game storylines (with authors' approval)
81+
7282

7383
## Training procedure
7484

0 commit comments

Comments
 (0)