bug: ECB Alignment issues with raw ECB files 

I've been looking through your processed ECB data (thanks for sharing a processed version) and cross-comparing with that of the original files. 

I've noticed there seems to be an alignment issue. If you look at your raw data https://raw.githubusercontent.com/NervanaSystems/nlp-architect/master/datasets/ecb/ecb_all_event_mentions.json 

```
{
        "coref_chain": "ACT15731460277214564",
        "doc_id": "1_21ecbplus.xml",
        "is_continuous": true,
        "is_singleton": false,
        "mention_head": "agreed",
        "mention_head_lemma": "agree",
        "mention_head_pos": "VERB",
        "mention_id": "1_21ecbplus.xml_6_15",
        "mention_ner": null,
        "mention_type": "ACT",
        "predicted_coref_chain": null,
        "score": -1.0,
        "sent_id": 6,
        "tokens_number": [
            15
        ],
        "tokens_str": "agreed",
        "topic_id": "1_ecbplus"
    },
```

If I then go back to the raw ECB xml files and look at sentence 6 in file `1_21ecbplus`, the corresponding tokens are: 
```
<token t_id="106" sentence="6" number="0">Nothing</token>
<token t_id="107" sentence="6" number="1">bad</token>
<token t_id="108" sentence="6" number="2">is</token>
<token t_id="109" sentence="6" number="3">going</token>
<token t_id="110" sentence="6" number="4">to</token>
<token t_id="111" sentence="6" number="5">happen</token>
<token t_id="112" sentence="6" number="6">.</token>
<token t_id="113" sentence="6" number="7">"</token>
``` 

Reason why I'd want to go back and check this is if i want to pull out the full token list and attach it to this payload, the alignment is off. 

Is this a bug? Or am I looking at this incorrectly... 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: ECB Alignment issues with raw ECB files #158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: ECB Alignment issues with raw ECB files #158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions