You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/cognitive-services/language-service/custom-named-entity-recognition/concepts/data-formats.md
+58-30Lines changed: 58 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,48 +24,76 @@ When you tag entities, the tags are saved as in the following JSON format. If yo
24
24
25
25
```json
26
26
{
27
-
//List of entity names. Their index within this array is used as an ID.
28
-
"entityNames": [
29
-
"entity_name1",
30
-
"entity_name2"
27
+
"extractors": [
28
+
{
29
+
"name": "Entity1"
30
+
},
31
+
{
32
+
"name": "Entity2"
33
+
}
31
34
],
32
-
"documents": "path_to_document", //Relative file path to get the text.
33
-
"culture": "en-US", //Standard culture strings supported by CultureInfo.
34
-
"entities": [
35
+
"documents": [
35
36
{
36
-
"regionStart": 0,
37
-
"regionLength": 69,
38
-
"labels": [
37
+
"location": "file1.txt",
38
+
"language": "en-us",
39
+
"extractors": [
39
40
{
40
-
"entity": 0, // Index of the entity in the "entityNames" array. Positions are relative to the original text (not bounding box)
41
-
"start": 4,
42
-
"length": 10
43
-
},
41
+
"regionOffset": 0,
42
+
"regionLength": 5129,
43
+
"labels": [
44
+
{
45
+
"extractorName": "Entity1",
46
+
"offset": 77,
47
+
"length": 10
48
+
},
49
+
{
50
+
"extractorName": "Entity2",
51
+
"offset": 3062,
52
+
"length": 8
53
+
}
54
+
]
55
+
}
56
+
]
57
+
},
58
+
{
59
+
"location": "file2.txt",
60
+
"language": "en-us",
61
+
"extractors": [
44
62
{
45
-
"entity": 1,
46
-
"start": 18,
47
-
"length": 11
63
+
"regionOffset": 0,
64
+
"regionLength": 6873,
65
+
"labels": [
66
+
{
67
+
"extractorName": "Entity2",
68
+
"offset": 60,
69
+
"length": 7
70
+
},
71
+
{
72
+
"extractorName": "Entity1",
73
+
"offset": 2805,
74
+
"length": 10
75
+
}
76
+
]
48
77
}
49
78
]
50
79
}
51
-
]
80
+
]
52
81
}
53
82
```
54
83
55
-
The following list describes the various JSON properties of the sample above.
84
+
### Data description
56
85
57
-
*`entityNames`: An array of entity names. Index of the entity within the array is used as its ID.
86
+
*`extractors`: An array of extractors for your data. Each extractor represents one of the entities you want to extract from your data.
58
87
*`documents`: An array of tagged documents.
59
-
*`location`: The path of the document relative to the JSON file. For example, docs on the same level as the tags file `file.txt`, for docs inside one directory level `dir1/file.txt`.
60
-
*`culture`: culture/language of the document. <!-- See [language support](../language-support.md) for more information. -->
61
-
*`entities`: Specifies the entity recognition tags.
62
-
*`regionStart`: The inclusive character position of the start of the text.
63
-
*`regionLength`: The length of the bounding box in terms of UTF16 characters. Training only considers the data in this region, so if this is a tagged file, set the `regionStart` to 0 and the `regionLength` to the last index of last character in the file. You can also set this region if you want to introduce a negative sample to the training, by defining the region as a portion of the file with no tags.
64
-
65
-
*`labels`: All tags occurring within the bounding box.
66
-
*`entity`: The index of the entity in the `entityNames` array.
67
-
*`start`: The inclusive character position of the start of the tag in the document text. This is not relative to the bounding box.
68
-
*`length`: The length of the tag in terms of UTF16 characters.
88
+
*`location`: The path of the file. The file has to be in root of the storage container.
89
+
*`language`: Language of the file. Use one of the [supported culture locales](../language-support.md).
90
+
*`extractors`: Array of extractor objects to be extracted from teh file.
91
+
*`regionOffset`: The inclusive character position of the start of the text.
92
+
*`regionLength`: The length of the bounding box in terms of UTF16 characters. Training only considers the data in this region.
93
+
*`labels`: Array of all the tagged entities within the specified region.
94
+
*`extractorName`: Type of the entity to be extracted.
95
+
*`offset`: The inclusive character position of the start of the entity. This is not relative to the bounding box.
96
+
*`length`: The length of the entity in terms of UTF16 characters.
0 commit comments