Skip to content

Commit 9126ece

Browse files
author
Jinho Choi
committed
Added DDRConvertDemo.
1 parent 2dda92c commit 9126ece

File tree

6 files changed

+217
-53
lines changed

6 files changed

+217
-53
lines changed

elit-ddr/README.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# DDR Conversion
2+
3+
DDR conversion generates the [deep dependency graphs](https://github.com/emorynlp/ddr) from the Penn Treebank style constituency trees.
4+
The conversion tool is written in Java and developed by [Emory NLP](http://nlp.mathcs.emory.edu) as a part of the [ELIT](https://elit.cloud) project.
5+
6+
## Installation
7+
8+
Add the following dependency to your maven project:
9+
10+
```
11+
<dependency>
12+
<groupId>cloud.elit</groupId>
13+
<artifactId>elit-ddr</artifactId>
14+
<version>0.0.4</version>
15+
</dependency>
16+
```
17+
18+
* Download the conversion script: [nlp4j-ddr.jar](http://nlp.mathcs.emory.edu/nlp4j/nlp4j-ddr.jar).
19+
* Make sure [Java 8 or above](http://www.oracle.com/technetwork/java/javase/downloads) is installed on your machine:
20+
21+
```
22+
$ java -version
23+
java version "1.8.x"
24+
Java(TM) SE Runtime Environment (build 1.8.x)
25+
...
26+
```
27+
28+
* Run the following command:
29+
30+
```
31+
java edu.emory.mathcs.nlp.bin.DDGConvert -i <filepath> [ -r -n -pe <string> -oe <string>]
32+
```
33+
34+
* `-i`: the path to the parse file or a directory containing the parse files to convert.
35+
* `-r`: if set, process all files with the extension in the subdirectories of the input directory recursively.
36+
* `-n`: if set, normalize the parse trees before the conversion.
37+
* `-pe`: the extension of the parse files; required if the input path indicates a directory (default: `parse`).
38+
* `-oe`: the extension of the output files (default: `ddg`).
39+
40+
## Corpora
41+
42+
DDG conversion has been tested on the following corpora. Some of these corpora require you to be a member of the [Linguistic Data Consortium](https://www.ldc.upenn.edu) (LDC). Retrieve the corpora from LDC and run the following command for each corpus to generate DDG.
43+
44+
* [OntoNotes Release 5.0](https://catalog.ldc.upenn.edu/LDC2013T19):
45+
46+
```
47+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGConvert -r -i ontonotes-release-5.0/data/files/data/english/annotations
48+
```
49+
50+
* [English Web Treebank](https://catalog.ldc.upenn.edu/LDC2012T13):
51+
52+
```
53+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGConvert -r -i eng_web_tbk/data -pe tree
54+
```
55+
56+
* [QuestionBank with Manually Revised Treebank Annotation 1.0](https://catalog.ldc.upenn.edu/LDC2012R121):
57+
58+
```
59+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGConvert -i QB-revised.tree
60+
```
61+
62+
## Merge
63+
64+
We have internally updated these corpora to reduce annotation errors and produce a richer representation. If you want to take advantage of our latest updates, merge the original annotation with our annotation. You still need to retrieve the original corpora from LDC.
65+
66+
* Clone this repository:
67+
68+
```
69+
git clone https://github.com/emorynlp/ddr.git
70+
```
71+
72+
* Run the following command:
73+
74+
```
75+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge <source path> <target path> <parse ext>
76+
```
77+
78+
* `<source path>`: the path to the original corpus.
79+
* `<target path>`: the path to our annotation.
80+
* `<parse ext`>: the extension of the parse files.
81+
82+
83+
* [OntoNotes Release 5.0](https://catalog.ldc.upenn.edu/LDC2013T19):
84+
85+
```
86+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge ontonotes-release-5.0/data/files/data/english/annotations ddr/english/ontonotes parse
87+
```
88+
89+
* [English Web Treebank](https://catalog.ldc.upenn.edu/LDC2012T13):
90+
91+
```
92+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge eng_web_tbk/data ddr/english/google/ewt tree
93+
```
94+
95+
* [QuestionBank with Manually Revised Treebank Annotation 1.0](https://catalog.ldc.upenn.edu/LDC2012R121):
96+
97+
```
98+
java -cp nlp4j-ddr.jar edu.emory.mathcs.nlp.bin.DDGMerge QB-revised.tree ddr/english/google/qb/QB-revised.tree.skel tree
99+
```
100+
101+
102+
## Format
103+
104+
DDG is represented in the tab separated values format (TSV), where each column represents a different field. The semantic roles are indicated in the `feats` column with the key, `sem`.
105+
106+
```
107+
1 You you PRP _ 3 nsbj 7:nsbj O
108+
2 can can MD _ 3 modal _ O
109+
3 ascend ascend VB _ 0 root _ O
110+
4 Victoria victoria NNP _ 5 com _ B-LOC
111+
5 Peak peak NNP _ 3 obj _ L-LOC
112+
6 to to TO _ 7 aux _ O
113+
7 get get VB sem=prp 3 advcl _ O
114+
8 a a DT _ 10 det _ O
115+
9 panoramic panoramic JJ _ 10 attr _ O
116+
10 view view NN _ 7 obj _ O
117+
11 of of IN _ 16 case _ O
118+
12 Victoria victoria NNP _ 13 com _ B-LOC
119+
13 Harbor harbor NNP _ 16 poss _ I-LOC
120+
14 's 's POS _ 13 case _ L-LOC
121+
15 beautiful beautiful JJ _ 16 attr _ O
122+
16 scenery scenery NN _ 10 ppmod _ O
123+
17 . . . _ 3 p _ O
124+
```
125+
126+
* `id`: current token ID (starting at 1).
127+
* `form`: word form.
128+
* `lemma`: lemma.
129+
* `pos`: part-of-speech tag.
130+
* `feats`: extra features; different features are delimited by `|`, keys and values are delimited by `=` (`_` indicates no feature).
131+
* `headId`: head token ID.
132+
* `deprel`: dependency label.
133+
* `sheads`: secondary heads (`_` indicates no secondary head).
134+
* `nament`: named entity tags in the `BILOU` notation if the annotation is available.

elit-ddr/src/main/java/cloud/elit/ddr/bin/DDGConvert.java renamed to elit-ddr/src/main/java/cloud/elit/ddr/bin/DDRConvert.java

Lines changed: 45 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -15,15 +15,13 @@
1515
*/
1616
package cloud.elit.ddr.bin;
1717

18-
import cloud.elit.ddr.constituency.CTNode;
1918
import cloud.elit.ddr.util.*;
2019
import cloud.elit.ddr.constituency.CTReader;
2120
import cloud.elit.ddr.constituency.CTTree;
2221
import cloud.elit.ddr.conversion.C2DConverter;
2322
import cloud.elit.ddr.conversion.EnglishC2DConverter;
2423
import cloud.elit.sdk.collection.tuple.ObjectIntIntTuple;
2524
import cloud.elit.sdk.structure.Chunk;
26-
import cloud.elit.sdk.structure.Document;
2725
import cloud.elit.sdk.structure.Sentence;
2826
import cloud.elit.sdk.structure.node.NLPNode;
2927
import it.unimi.dsi.fastutil.ints.Int2ObjectMap;
@@ -34,7 +32,7 @@
3432
import java.util.ArrayList;
3533
import java.util.List;
3634

37-
public class DDGConvert {
35+
public class DDRConvert {
3836
@Option(name = "-d", usage = "input path (required)", required = true, metaVar = "<filepath>")
3937
private String input_path;
4038
@Option(name = "-pe", usage = "parse file extension (default: parse)", metaVar = "<string>")
@@ -46,24 +44,49 @@ public class DDGConvert {
4644
@Option(name = "-r", usage = "if set, traverse parse files recursively", metaVar = "<boolean>")
4745
private boolean recursive = false;
4846

49-
public DDGConvert() {
47+
public DDRConvert() {
5048

5149
}
5250

53-
public DDGConvert(String[] args) {
51+
public DDRConvert(String[] args) {
5452
BinUtils.initArgs(args, this);
5553
Language language = Language.ENGLISH;
5654

5755
List<String> parseFiles = FileUtils.getFileList(input_path, parse_ext, recursive);
5856
C2DConverter converter = new EnglishC2DConverter();
5957

6058
for (String parseFile : parseFiles) {
61-
int n = convert(converter, language, parseFile, parse_ext, output_ext, normalize);
59+
int n = convert(converter, language, parseFile, parseFile + "." + output_ext, normalize);
6260
System.out.printf("%s: %d trees\n", parseFile, n);
6361
}
6462
}
6563

66-
Int2ObjectMap<List<ObjectIntIntTuple<String>>> getNER(String parseFile) {
64+
int convert(C2DConverter converter, Language language, String parseFile, String outputFile, boolean normalize) {
65+
Int2ObjectMap<List<ObjectIntIntTuple<String>>> ner_map = getNamedEntities(parseFile);
66+
CTReader reader = new CTReader(IOUtils.createFileInputStream(parseFile), language);
67+
PrintStream fout = IOUtils.createBufferedPrintStream(outputFile);
68+
Sentence dTree;
69+
CTTree cTree;
70+
int n;
71+
72+
for (n = 0; (cTree = reader.next()) != null; n++) {
73+
if (normalize) cTree.normalizeIndices();
74+
dTree = converter.toDependencyGraph(cTree);
75+
76+
if (dTree == null) {
77+
System.err.println("No token in the tree " + (n + 1) + "\n" + cTree.toStringLine());
78+
} else {
79+
processNamedEntities(ner_map, cTree, dTree, n);
80+
fout.println(dTree.toTSV() + "\n");
81+
}
82+
}
83+
84+
reader.close();
85+
fout.close();
86+
return n;
87+
}
88+
89+
Int2ObjectMap<List<ObjectIntIntTuple<String>>> getNamedEntities(String parseFile) {
6790
final String nameFile = parseFile.substring(0, parseFile.length() - 5) + "name";
6891
Int2ObjectMap<List<ObjectIntIntTuple<String>>> map = new Int2ObjectOpenHashMap<>();
6992
File file = new File(nameFile);
@@ -85,57 +108,31 @@ Int2ObjectMap<List<ObjectIntIntTuple<String>>> getNER(String parseFile) {
85108
}
86109
}
87110
} catch (Exception e) {
111+
map = null;
88112
e.printStackTrace();
89113
}
90114

91115
return map;
92116
}
93117

94-
int convert(C2DConverter converter, Language language, String parseFile, String parseExt, String outputExt, boolean normalize) {
95-
CTReader reader = new CTReader(IOUtils.createFileInputStream(parseFile), language);
96-
Int2ObjectMap<List<ObjectIntIntTuple<String>>> ner_map = getNER(parseFile);
97-
Document doc = new Document();
98-
Sentence dTree;
99-
CTTree cTree;
100-
101-
for (int n = 0; (cTree = reader.next()) != null; n++) {
102-
for (CTNode nn : cTree.getTokens()) {
103-
if (nn.isSyntacticTag("EMO")) nn.setSyntacticTag(PTBLib.P_NFP);
104-
}
105-
106-
if (normalize) cTree.normalizeIndices();
107-
dTree = converter.toDependencyGraph(cTree);
108-
109-
if (dTree == null)
110-
System.err.println("No token in the tree " + (n + 1) + "\n" + cTree.toStringLine());
111-
else {
112-
doc.add(dTree);
113-
114-
if (ner_map == null)
115-
dTree.setNamedEntities(null);
116-
else if (ner_map.containsKey(n)) {
117-
List<Chunk> chunks = new ArrayList<>();
118+
void processNamedEntities(Int2ObjectMap<List<ObjectIntIntTuple<String>>> ner_map, CTTree cTree, Sentence dTree, int sen_id) {
119+
if (ner_map == null) {
120+
dTree.setNamedEntities(null);
121+
return;
122+
}
118123

119-
for (ObjectIntIntTuple<String> t : ner_map.get(n)) {
120-
List<NLPNode> nodes = new ArrayList<>();
124+
List<ObjectIntIntTuple<String>> list = ner_map.get(sen_id);
121125

122-
for (int tok_id = cTree.getTerminal(t.i1).getTokenID(); tok_id < cTree.getTerminal(t.i2).getTokenID() + 1; tok_id++)
123-
nodes.add(dTree.get(tok_id));
126+
if (list != null) {
127+
for (ObjectIntIntTuple<String> t : list) {
128+
List<NLPNode> nodes = new ArrayList<>();
124129

125-
chunks.add(new Chunk(nodes, t.o));
126-
}
130+
for (int tok_id = cTree.getTerminal(t.i1).getTokenID(); tok_id < cTree.getTerminal(t.i2).getTokenID() + 1; tok_id++)
131+
nodes.add(dTree.get(tok_id));
127132

128-
dTree.setNamedEntities(chunks);
129-
}
133+
dTree.addNamedEntity(new Chunk(nodes, t.o));
130134
}
131135
}
132-
133-
reader.close();
134-
135-
PrintStream fout = IOUtils.createBufferedPrintStream(parseFile + "." + outputExt);
136-
fout.println(outputExt.equalsIgnoreCase("tsv") ? doc.toTSV() : doc.toString());
137-
fout.close();
138-
return doc.size();
139136
}
140137

141138
void convertEnglish() {
@@ -148,16 +145,15 @@ void convertEnglish() {
148145
boolean norm = dir.equals("bionlp") || dir.equals("bolt");
149146

150147
for (String parseFile : parseFiles) {
151-
int n = convert(converter, Language.ENGLISH, parseFile, "parse", "tsv", norm);
148+
int n = convert(converter, Language.ENGLISH, parseFile, "tsv", norm);
152149
System.out.printf("%s: %d trees\n", parseFile, n);
153150
}
154151
}
155152
}
156153

157154
public static void main(String[] args) {
158155
try {
159-
new DDGConvert(args);
160-
// new DDGConvert().convertEnglish();
156+
new DDRConvert(args);
161157
} catch (Exception e) {
162158
e.printStackTrace();
163159
}
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
/*
2+
* Copyright 2018 Emory University
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
package cloud.elit.ddr.bin;
18+
19+
import cloud.elit.ddr.constituency.CTTree;
20+
import cloud.elit.ddr.conversion.C2DConverter;
21+
import cloud.elit.ddr.conversion.EnglishC2DConverter;
22+
import cloud.elit.ddr.util.Language;
23+
import cloud.elit.sdk.structure.Document;
24+
import cloud.elit.sdk.structure.Sentence;
25+
26+
public class DDRConvertDemo {
27+
public static void main(String[] args) {
28+
final String parseFile = "/Users/jdchoi/workspace/elit-java/relcl.parse";
29+
final String tsvFile = "/Users/jdchoi/workspace/elit-java/relcl.tsv";
30+
C2DConverter converter = new EnglishC2DConverter();
31+
DDRConvert ddr = new DDRConvert();
32+
ddr.convert(converter, Language.ENGLISH, parseFile, tsvFile, false);
33+
}
34+
}

elit-ddr/src/main/java/cloud/elit/ddr/conversion/C2DConverter.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ private CTNode getTerminalHead(CTNode node) {
219219
protected Sentence createDependencyGraph(CTTree tree) {
220220
List<CTNode> tokens = tree.getTokens();
221221
Sentence graph = new Sentence();
222-
String form, pos, lemma, nament;
222+
String form, pos, lemma;
223223
NLPNode node, head;
224224
int id;
225225

elit-sdk/src/main/java/cloud/elit/sdk/structure/Sentence.java

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
*/
1616
package cloud.elit.sdk.structure;
1717

18-
import cloud.elit.sdk.structure.node.NLPArc;
1918
import cloud.elit.sdk.structure.node.NLPNode;
2019
import cloud.elit.sdk.structure.util.ELITUtils;
2120
import cloud.elit.sdk.structure.util.Fields;
@@ -256,8 +255,7 @@ private void toTSVNamedEntities(List<List<String>> conll) {
256255
}
257256

258257
for (List<String> c : conll) {
259-
if (c.size() < 9)
260-
c.add("O");
258+
if (c.size() < 9) c.add("O");
261259
}
262260
}
263261
}

elit-sdk/src/main/java/cloud/elit/sdk/structure/node/Node.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,7 @@ public boolean addChild(int index, N node) {
152152
* @param node the node.
153153
* @return the previously index'th node if added; otherwise, {@code null}.
154154
*/
155+
@SuppressWarnings("UnusedReturnValue")
155156
public N setChild(int index, N node) {
156157
if (!isParentOf(node)) {
157158
if (node.hasParent())
@@ -173,6 +174,7 @@ public N setChild(int index, N node) {
173174
* @param node the node.
174175
* @return the removed child if exists; otherwise, {@code null}.
175176
*/
177+
@SuppressWarnings("UnusedReturnValue")
176178
public N removeChild(N node) {
177179
return removeChild(indexOf(node));
178180
}

0 commit comments

Comments
 (0)