Skip to content

Commit bad5de4

Browse files
authored
Update data-formats.md
1 parent 3c682d0 commit bad5de4

File tree

1 file changed

+58
-30
lines changed
  • articles/cognitive-services/language-service/custom-named-entity-recognition/concepts

1 file changed

+58
-30
lines changed

articles/cognitive-services/language-service/custom-named-entity-recognition/concepts/data-formats.md

Lines changed: 58 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -24,48 +24,76 @@ When you tag entities, the tags are saved as in the following JSON format. If yo
2424

2525
```json
2626
{
27-
//List of entity names. Their index within this array is used as an ID.
28-
"entityNames": [
29-
"entity_name1",
30-
"entity_name2"
27+
"extractors": [
28+
{
29+
"name": "Entity1"
30+
},
31+
{
32+
"name": "Entity2"
33+
}
3134
],
32-
"documents": "path_to_document", //Relative file path to get the text.
33-
"culture": "en-US", //Standard culture strings supported by CultureInfo.
34-
"entities": [
35+
"documents": [
3536
{
36-
"regionStart": 0,
37-
"regionLength": 69,
38-
"labels": [
37+
"location": "file1.txt",
38+
"language": "en-us",
39+
"extractors": [
3940
{
40-
"entity": 0, // Index of the entity in the "entityNames" array. Positions are relative to the original text (not bounding box)
41-
"start": 4,
42-
"length": 10
43-
},
41+
"regionOffset": 0,
42+
"regionLength": 5129,
43+
"labels": [
44+
{
45+
"extractorName": "Entity1",
46+
"offset": 77,
47+
"length": 10
48+
},
49+
{
50+
"extractorName": "Entity2",
51+
"offset": 3062,
52+
"length": 8
53+
}
54+
]
55+
}
56+
]
57+
},
58+
{
59+
"location": "file2.txt",
60+
"language": "en-us",
61+
"extractors": [
4462
{
45-
"entity": 1,
46-
"start": 18,
47-
"length": 11
63+
"regionOffset": 0,
64+
"regionLength": 6873,
65+
"labels": [
66+
{
67+
"extractorName": "Entity2",
68+
"offset": 60,
69+
"length": 7
70+
},
71+
{
72+
"extractorName": "Entity1",
73+
"offset": 2805,
74+
"length": 10
75+
}
76+
]
4877
}
4978
]
5079
}
51-
]
80+
]
5281
}
5382
```
5483

55-
The following list describes the various JSON properties of the sample above.
84+
### Data description
5685

57-
* `entityNames`: An array of entity names. Index of the entity within the array is used as its ID.
86+
* `extractors`: An array of extractors for your data. Each extractor represents one of the entities you want to extract from your data.
5887
* `documents`: An array of tagged documents.
59-
* `location`: The path of the document relative to the JSON file. For example, docs on the same level as the tags file `file.txt`, for docs inside one directory level `dir1/file.txt`.
60-
* `culture`: culture/language of the document. <!-- See [language support](../language-support.md) for more information. -->
61-
* `entities`: Specifies the entity recognition tags.
62-
* `regionStart`: The inclusive character position of the start of the text.
63-
* `regionLength`: The length of the bounding box in terms of UTF16 characters. Training only considers the data in this region, so if this is a tagged file, set the `regionStart` to 0 and the `regionLength` to the last index of last character in the file. You can also set this region if you want to introduce a negative sample to the training, by defining the region as a portion of the file with no tags.
64-
65-
* `labels`: All tags occurring within the bounding box.
66-
* `entity`: The index of the entity in the `entityNames` array.
67-
* `start`: The inclusive character position of the start of the tag in the document text. This is not relative to the bounding box.
68-
* `length`: The length of the tag in terms of UTF16 characters.
88+
* `location`: The path of the file. The file has to be in root of the storage container.
89+
* `language`: Language of the file. Use one of the [supported culture locales](../language-support.md).
90+
* `extractors`: Array of extractor objects to be extracted from teh file.
91+
* `regionOffset`: The inclusive character position of the start of the text.
92+
* `regionLength`: The length of the bounding box in terms of UTF16 characters. Training only considers the data in this region.
93+
* `labels`: Array of all the tagged entities within the specified region.
94+
* `extractorName`: Type of the entity to be extracted.
95+
* `offset`: The inclusive character position of the start of the entity. This is not relative to the bounding box.
96+
* `length`: The length of the entity in terms of UTF16 characters.
6997

7098
## Next steps
7199

0 commit comments

Comments
 (0)