1- The PROIEL Treebank
2- ===================
1+ ## The PROIEL Treebank
32
43The _ PROIEL Treebank_ is a dependency treebank with morphosyntactic and
5- information-structure annotation. It includes texts in several ancient
6- Indo-European languages and is freely available under a [ Creative Commons
7- Attribution-NonCommercial-ShareAlike 3.0 License] (
8- http://creativecommons.org/licenses/by-nc-sa/3.0/us/ ).
4+ information-structure annotation.
5+
6+ It includes texts in several ancient Indo-European languages and is freely
7+ available under a [ Creative Commons Attribution-NonCommercial-ShareAlike 4.0
8+ License] ( https://creativecommons.org/licenses/by-nc-sa/4.0/ ) .
99
1010Please cite as
1111
1212> Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.
1313
14- Releases of the PROIEL Treebank are hosted on
15- [ Github] ( https://github.com/proiel/proiel-treebank ) .
16-
17- Contents
18- --------
19-
20- The following texts are included in this release of the treebank:
21-
22- Text | Language | Filename | Size
23- ----------------------------------------------------|---------------------|-------------|---------------
24- The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,763 tokens
25- The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
26- The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,211 tokens
27- Codex Marianus (ed. Jagić 1883) | Old Church Slavonic | marianus | 58,269 tokens
28- Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
29- Caesar, Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,607 tokens
30- Cicero, De officiis (ed. Miller 1913) | Latin | cic-off | 10,644 tokens
31- Cicero, Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 42,855 tokens
32- Palladius, Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
33- Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
34- Herodotus, Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,080 tokens
35- Sphrantzes, Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
36-
37- (The 'size' column in the table above shows the number of annotated tokens in
38- a text. The number of tokens will be slightly larger than the number of words
39- in the original printed edition as some words have been split into multiple
40- tokens and some tokens have been inserted during annotation.)
41-
4214Please see the XML files for detailed metadata and a full list of contributors.
4315
16+ ### Completeness
17+
4418Some sentences have not yet been annotated. This is an overview of where in the
4519texts unannotated sentences occur:
4620
@@ -64,17 +38,27 @@ Sections or section ranges in which there are gaps:
6438* ` marianus ` : MATT 5, MARK 16, LUKE 2, LUKE 24, JOHN 1-2, JOHN 18, JOHN 20
6539* ` pal-agr ` : 1.4-1.12, 1.35-1.40, 2.3, 2.9-2.23, 3.9-3.10
6640
67- These gaps will be completed in future releases.
41+ These gaps may be closed in future releases.
6842
69- Data formats
70- ------------
43+ ### Contents
7144
72- The texts are available on two formats:
73-
74- 1 . PROIEL XML: These files are the authoritative source files and the only ones
75- that contain all available annotation. They contain the complete morphological,
76- syntactic and information-structure annotation, as well as the complete text,
77- including punctuation, section headers etc. The schema is defined in
78- [ ` proiel.xsd ` ] ( https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd ) .
45+ The following texts are included in this release of the treebank:
7946
80- 2 . [ CoNLL-X format] ( http://nextens.uvt.nl/depparse-wiki/DataFormat )
47+ (The _ size_ column in the table below shows the number of annotated tokens in a
48+ text. The number of tokens will be slightly larger than the number of words in
49+ the original printed edition as some words have been split into multiple tokens
50+ and some tokens have been inserted during annotation.)
51+ Text | Language | Filename | Size
52+ ----------------------------------------------------|---------------------|-------------|---------------
53+ The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
54+ Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,657 tokens
55+ Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
56+ Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 47,528 tokens
57+ De officiis (ed. Miller 1913) | Latin | cic-off | 11,995 tokens
58+ The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,212 tokens
59+ The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,773 tokens
60+ Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,166 tokens
61+ Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
62+ Codex Marianus (ed. Jagić 1883) | Church Slavic | marianus | 64,138 tokens
63+ Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
64+ Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
0 commit comments