Skip to content

Inconsistencies in dataset annotations #9

@janpf

Description

@janpf

Hi!
I've just found an inconsistency in the Darmstadt dataset dev split. I haven't checked whether this also occurs in different datasets or in different splits.

Two back-to-back examples in the dev split look like this:

{
  "sent_id": "DeVry_University_95_05-16-2004-6",
  "text": "I can't overemphasize that enough .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "that"
        ],
        [
          "22:26"
        ]
      ],
      "Polar_expression": [
        [
          "can't overemphasize enough"
        ],
        [
          "2:33"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},
{
  "sent_id": "DeVry_University_95_05-16-2004-7",
  "text": "The school gives students a knowledge base that makes them extremely competitive in the corporate world .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "students"
        ],
        [
          "17:25"
        ]
      ],
      "Polar_expression": [
        [
          "extremely",
          "competitive"
        ],
        [
          "59:68",
          "69:80"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},

Usually the datapoints are handled like in the second sentence: polar expressions (as well as source and target fields for that matter) are whitespace separated, even if the words are directly back-to-back. In the first sentence the whole polar expression is listed as a whole though and the span("2:33") even includes the target word("that"|"22:26") while it is not present in the string("can't overemphasize enough").
I'm actually unsure whether this issue stems for the provided preprocessing function or the underlying dataset.

I also noticed that for this example sentence both splitting methods are applied for polar_expression:

{
  "sent_id": "Capella_University_50_12-09-2005-3",
  "text": "I have found the course work and research more challenging and of higher quality at Capella than at any of the other institutions I graduated from .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "course work research"
        ],
        [
          "17:41"
        ]
      ],
      "Polar_expression": [
        [
          "higher quality"
        ],
        [
          "66:80"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Average"
    },
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "course work research"
        ],
        [
          "17:41"
        ]
      ],
      "Polar_expression": [
        [
          "more",
          "challenging"
        ],
        [
          "42:46",
          "47:58"
        ]
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},

sometimes the indices for the polarity_expression strings are also missing:

{
  "sent_id": "St_Leo_University_4_04-16-2004-5",
  "text": "The teachers are very helpful , and the staff is , as well .",
  "opinions": [
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "teachers"
        ],
        [
          "4:12"
        ]
      ],
      "Polar_expression": [
        [
          "very",
          "helpful"
        ],
        []
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    },
    {
      "Source": [
        [],
        []
      ],
      "Target": [
        [
          "staff"
        ],
        [
          "40:45"
        ]
      ],
      "Polar_expression": [
        [
          "very",
          "helpful"
        ],
        []
      ],
      "Polarity": "Positive",
      "Intensity": "Strong"
    }
  ]
},

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions