-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hi!
I've just found an inconsistency in the Darmstadt dataset dev split. I haven't checked whether this also occurs in different datasets or in different splits.
Two back-to-back examples in the dev split look like this:
{
"sent_id": "DeVry_University_95_05-16-2004-6",
"text": "I can't overemphasize that enough .",
"opinions": [
{
"Source": [
[],
[]
],
"Target": [
[
"that"
],
[
"22:26"
]
],
"Polar_expression": [
[
"can't overemphasize enough"
],
[
"2:33"
]
],
"Polarity": "Positive",
"Intensity": "Strong"
}
]
},
{
"sent_id": "DeVry_University_95_05-16-2004-7",
"text": "The school gives students a knowledge base that makes them extremely competitive in the corporate world .",
"opinions": [
{
"Source": [
[],
[]
],
"Target": [
[
"students"
],
[
"17:25"
]
],
"Polar_expression": [
[
"extremely",
"competitive"
],
[
"59:68",
"69:80"
]
],
"Polarity": "Positive",
"Intensity": "Strong"
}
]
},Usually the datapoints are handled like in the second sentence: polar expressions (as well as source and target fields for that matter) are whitespace separated, even if the words are directly back-to-back. In the first sentence the whole polar expression is listed as a whole though and the span("2:33") even includes the target word("that"|"22:26") while it is not present in the string("can't overemphasize enough").
I'm actually unsure whether this issue stems for the provided preprocessing function or the underlying dataset.
I also noticed that for this example sentence both splitting methods are applied for polar_expression:
{
"sent_id": "Capella_University_50_12-09-2005-3",
"text": "I have found the course work and research more challenging and of higher quality at Capella than at any of the other institutions I graduated from .",
"opinions": [
{
"Source": [
[],
[]
],
"Target": [
[
"course work research"
],
[
"17:41"
]
],
"Polar_expression": [
[
"higher quality"
],
[
"66:80"
]
],
"Polarity": "Positive",
"Intensity": "Average"
},
{
"Source": [
[],
[]
],
"Target": [
[
"course work research"
],
[
"17:41"
]
],
"Polar_expression": [
[
"more",
"challenging"
],
[
"42:46",
"47:58"
]
],
"Polarity": "Positive",
"Intensity": "Strong"
}
]
},sometimes the indices for the polarity_expression strings are also missing:
{
"sent_id": "St_Leo_University_4_04-16-2004-5",
"text": "The teachers are very helpful , and the staff is , as well .",
"opinions": [
{
"Source": [
[],
[]
],
"Target": [
[
"teachers"
],
[
"4:12"
]
],
"Polar_expression": [
[
"very",
"helpful"
],
[]
],
"Polarity": "Positive",
"Intensity": "Strong"
},
{
"Source": [
[],
[]
],
"Target": [
[
"staff"
],
[
"40:45"
]
],
"Polar_expression": [
[
"very",
"helpful"
],
[]
],
"Polarity": "Positive",
"Intensity": "Strong"
}
]
},