Skip to content

Commit 7ca2b61

Browse files
committed
Improve README
1 parent d70d57f commit 7ca2b61

File tree

1 file changed

+68
-7
lines changed

1 file changed

+68
-7
lines changed

README.md

Lines changed: 68 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,7 @@ like phone numbers, but that are in fact invalid, will NOT be detected as the Se
276276
* Assuming the entire set of stream names is available, Semantic Type detection of a particular column may be impacted by other stream names, for example the Semantic Type PERSON.AGE is detected if we detect another field of type GENDER or NAME.FIRST.
277277
* When using Record mode for Semantic Type analysis - the detection of Semantic Types for a stream may be impacted by prior determination of the Semantic Type of another Stream (either via detection or provided with the Context)
278278
* By default analysis is performed on the initial 4096 characters of the field (adjustable via setMaxInputLength()).
279+
* If two Semantic Types have equal confidence then the Semantic Type with the highest priority will be selected.
279280

280281
[Details of Semantic Types detected](SemanticTypes.md)
281282

@@ -295,9 +296,9 @@ F1-Score == 2 * ((Precision * Recall) / (Precision + Recall))
295296
### Additional user-defined Semantic Types ###
296297

297298
Additional Semantic types can be detected by registering additional plugins (see registerPlugins). There are three basic types of plugins:
298-
* Code - captures any complex type (e.g. Even numbers, Credit Cards numbers). Implemented via a Java Class.
299-
* Finite - captures any finite type (e.g. ISO-3166-2 (Country codes), US States, ...). Implemented via a supplied list with the valid elements enumerated.
300-
* RegExp - captures any type that can be expressed via a Regular Expression (e.g. SSN). Implemented via a set of Regular Expressions used to match against.
299+
* RegExp (regex) - captures any type that can be expressed via a Regular Expression (e.g. SSN). Implemented via a set of Regular Expressions used to match against.
300+
* Finite (list) - captures any finite type (e.g. ISO-3166-2 (Country codes), US States, ...). Implemented via a supplied list with the valid elements enumerated.
301+
* Code (java) - captures any complex type (e.g. Even numbers, Credit Cards numbers). Implemented via a Java Class.
301302

302303
Note: The Context (the current Stream Name and other field names) can be used to bias detection of the incoming data and/or solely determine the detection.
303304

@@ -373,7 +374,7 @@ In all cases the plugin definition and locale are passed as arguments.
373374

374375
The mandatory 'semanticType' tag is the name of this Semantic Type.
375376

376-
The 'threshold' tag is the percentage of valid samples required by this plugin to establish the Stream Data as a a valid instance of this Semantic Type.
377+
The 'threshold' tag is the percentage confidence we require to establish the Stream Data as a valid instance of this Semantic Type. In the simplest case this can be the percentage of samples detected as valid in the provided stream. More commonly the confidence is determined by a combination of the header confidence and the observed data.
377378
The threshold will default to 95% if not specified.
378379

379380
The 'baseType' tag constrains the plugin to streams that are of this Base Type (see discussion above on the valid Base Types).
@@ -399,13 +400,73 @@ The optional 'minSamples' tag indicates that in order for this Semantic Type to
399400

400401
The optional 'invalidList' tag is a list of invalid values for this Semantic Type, for example '[ "000-00-0000" ]' indicates that this is an invalid SSN, despite the fact that it matches the SSN regular expression.
401402

403+
#### Example
404+
The following example is looking for an Indian Postal Code. In this case the header is mandatory so we will insist on both detecting a regular expression of the form '\d{6}' and a case independent match for the header. The plugin will return '[1-9]\\d{5}' as it is illegal to have a leading zero for an Indian Postal Code.
405+
406+
```json
407+
{
408+
"semanticType": "POSTAL_CODE.POSTAL_CODE_IN",
409+
"description": "Postal Code (IN)",
410+
"pluginType": "regex",
411+
"validLocales": [
412+
{
413+
"localeTag": "en-IN,hi-IN",
414+
"headerRegExps": [ { "regExp": ".*(?i)(?u)(pincode).*", "confidence": 95, "mandatory": true } ],
415+
"matchEntries": [ {
416+
"regExpsToMatch": [ "\\d{6}" ],
417+
"regExpReturned": "[1-9]\\d{5}"
418+
} ]
419+
}
420+
],
421+
"threshold": 98
422+
}
423+
```
424+
402425
### Finite plugins ###
403426

404427
The mandatory 'content' element is required.
405428

406429
The 'type' tag determines how the content is provided (possible values are 'inline', 'resource', or 'file').
407430
If the type is 'inline' then the tag 'members' is the array of possible values. If the type is 'resource' or 'file' then the tag 'reference' is the file/resource that contains the list of values. Note: the list of possible values is required to be upper case and encoded in UTF-8.
408431

432+
#### Example
433+
```json
434+
{
435+
"semanticType": "CUSTOM.ELEMENTS",
436+
"description": "Periodic Table Elements",
437+
"pluginType": "list",
438+
"validLocales": [ {
439+
"localeTag": "en"
440+
} ],
441+
"threshold": 95,
442+
"content": {
443+
"type": "resource",
444+
"reference": "/elements.csv"
445+
},
446+
"documentation": [
447+
{ "source": "wikipedia", "reference": "https://en.wikipedia.org/wiki/Periodic_table" }
448+
],
449+
"backout": "\\\\p{IsAlphabetic}{1,2}"
450+
}
451+
```
452+
453+
### Code plugins ###
454+
455+
Code plugins are implemented via a Java class. This class will typically either extend LogicalTypeInfinite for types with a large number of members, or extend LogicalTypeFinite for a type with a finite number of members. For a simple example refer to the code to detect IPv4 addresses (IPV4Address.java) or the sample PluginColor.java.
456+
457+
#### Key methods
458+
isCandidate() - Fast check to see if the input might be an instance of this Semantic type.
459+
460+
isValid() - Is the supplied input an instance of this Semantic type?
461+
462+
getRegExp() - The Regular Expression that most closely matches this Semantic Type.
463+
464+
getConfidence() - Will default to the number of valid samples / size of the sample set. This is commonly overridden to bias the confidence based on the field name.
465+
466+
nextRandom() - Will generate a random (secure) valid example of this Semantic Type.
467+
468+
analyzeSet() - Given the data set analyzed determine if this set is likely an instance of this Semantic Type.
469+
409470
## Invalid Set ##
410471

411472
An invalid entry is one that is not valid for the detected type and/or Semantic type.
@@ -453,7 +514,7 @@ The DataSignature will be identical for AccountLocation and PrimaryCountry as th
453514

454515
Additional attributes captured in JSON structure:
455516
- Included if statistics are enabled: min, max, mean, standardDeviation, topK, bottomK
456-
- Included if Base Type == Double: decimalSeparator
517+
- Included if Base Type is Double: decimalSeparator
457518
- Included if Base Type is Numeric: leadingZeroCount
458519
- Included if Base Type is Date: dateResolutionMode
459520

@@ -536,11 +597,11 @@ Within the specification the type is required and can either be a Semantic Type
536597
- format - the format for outputting this field (e.g. %03d for a LONG)
537598
- distribution - the distribution of the samples (gaussian, monotonic_increasing, monotonic_decreasing; the default is normal)
538599
- nullPercent - the percentage of nulls in this field
539-
- blankPerent - the percentage of blanks in this field
600+
- blankPercent - the percentage of blanks in this field
540601
- values - for an STRING type, the possible set of values can be specified
541602

542603
## Merging Analyses ##
543-
FTA supports merging of analyses run on distinct data shards. So for example, if part of the data to be profiled resides on one shard and the balance on a separate shard then FTA can be invoked on each shard separately and then merged. To accomplish this individual analyses should be executed (with similar configurations), the resulting serialized forms should then be deserialized on a common node and merged. Refer to the Merge example for further details.
604+
FTA supports merging of analyses run on distinct data shards. So for example, if part of the data to be profiled resides on one shard and the balance on a separate shard then FTA can be invoked on each shard separately and then merged. To accomplish this, individual analyses should be executed (with similar configurations), the resulting serialized forms should then be deserialized on a common node and merged. Refer to the Merge example for further details.
544605

545606
The accuracy of the merge is determined by the cardinality of the two individual shards, and falls into one of the the following three cases:
546607
- cardinality(one) + cardinality(two) < max cardinality

0 commit comments

Comments
 (0)