You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-6Lines changed: 16 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -57,18 +57,28 @@ If other limitations or errors are found, please open an issue.
57
57
58
58
## Training data
59
59
60
-
Initialized with GPT(774M,https://github.com/openai/gpt-2/blob/master/model_card.md). Trained with the following data:
61
-
- Sejong Corpus
60
+
Initialized with GPT(774M,https://github.com/openai/gpt-2/blob/master/model_card.md).
61
+
62
+
The following data was used, and is available for redistribution [here](https://static.ksjit.com/datasets):
63
+
62
64
- Namuwiki database dump, Early 2020
63
65
- KCC(Kookmin University Corpus)
64
66
- Dump of Korean Wikipedia
65
-
- A *PRIVATE* collection of korean novels(with copyright cleared)
66
-
- Game storylines (with authors' approval)
67
67
- NAVER movie reviews
68
-
- Korean news(about 1GB) from a German university
68
+
- Korean news(about 1GB) from Leipzig(a German university)
69
69
- Context data from KorSQUAD questions
70
+
- Parsed CommonCrawl data(WIP)
71
+
72
+
Please note the completed dataset includes <|endoftext|> tags.
73
+
74
+
The following data were used, but is unavailable for redistribution:
75
+
76
+
- Sejong Corpus
70
77
- '모두의 말뭉치' from corpus.korean.go.kr
71
-
- COBRA webcrawl (10GB)
78
+
- A *PRIVATE* collection of korean novels
79
+
- Webcrawl of modern, uploaded text novels('텍본' - If you want to prevent your novel from going in the training set, please contact me and I will blacklist it)
0 commit comments