You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+68-7Lines changed: 68 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -276,6 +276,7 @@ like phone numbers, but that are in fact invalid, will NOT be detected as the Se
276
276
* Assuming the entire set of stream names is available, Semantic Type detection of a particular column may be impacted by other stream names, for example the Semantic Type PERSON.AGE is detected if we detect another field of type GENDER or NAME.FIRST.
277
277
* When using Record mode for Semantic Type analysis - the detection of Semantic Types for a stream may be impacted by prior determination of the Semantic Type of another Stream (either via detection or provided with the Context)
278
278
* By default analysis is performed on the initial 4096 characters of the field (adjustable via setMaxInputLength()).
279
+
* If two Semantic Types have equal confidence then the Semantic Type with the highest priority will be selected.
279
280
280
281
[Details of Semantic Types detected](SemanticTypes.md)
Additional Semantic types can be detected by registering additional plugins (see registerPlugins). There are three basic types of plugins:
298
-
*Code - captures any complex type (e.g. Even numbers, Credit Cards numbers). Implemented via a Java Class.
299
-
* Finite - captures any finite type (e.g. ISO-3166-2 (Country codes), US States, ...). Implemented via a supplied list with the valid elements enumerated.
300
-
*RegExp - captures any type that can be expressed via a Regular Expression (e.g. SSN). Implemented via a set of Regular Expressions used to match against.
299
+
*RegExp (regex) - captures any type that can be expressed via a Regular Expression (e.g. SSN). Implemented via a set of Regular Expressions used to match against.
300
+
* Finite (list) - captures any finite type (e.g. ISO-3166-2 (Country codes), US States, ...). Implemented via a supplied list with the valid elements enumerated.
301
+
*Code (java) - captures any complex type (e.g. Even numbers, Credit Cards numbers). Implemented via a Java Class.
301
302
302
303
Note: The Context (the current Stream Name and other field names) can be used to bias detection of the incoming data and/or solely determine the detection.
303
304
@@ -373,7 +374,7 @@ In all cases the plugin definition and locale are passed as arguments.
373
374
374
375
The mandatory 'semanticType' tag is the name of this Semantic Type.
375
376
376
-
The 'threshold' tag is the percentage of valid samples required by this plugin to establish the Stream Data as a a valid instance of this Semantic Type.
377
+
The 'threshold' tag is the percentage confidence we require to establish the Stream Data as a valid instance of this Semantic Type. In the simplest case this can be the percentage of samples detected as valid in the provided stream. More commonly the confidence is determined by a combination of the header confidence and the observed data.
377
378
The threshold will default to 95% if not specified.
378
379
379
380
The 'baseType' tag constrains the plugin to streams that are of this Base Type (see discussion above on the valid Base Types).
@@ -399,13 +400,73 @@ The optional 'minSamples' tag indicates that in order for this Semantic Type to
399
400
400
401
The optional 'invalidList' tag is a list of invalid values for this Semantic Type, for example '[ "000-00-0000" ]' indicates that this is an invalid SSN, despite the fact that it matches the SSN regular expression.
401
402
403
+
#### Example
404
+
The following example is looking for an Indian Postal Code. In this case the header is mandatory so we will insist on both detecting a regular expression of the form '\d{6}' and a case independent match for the header. The plugin will return '[1-9]\\d{5}' as it is illegal to have a leading zero for an Indian Postal Code.
The 'type' tag determines how the content is provided (possible values are 'inline', 'resource', or 'file').
407
430
If the type is 'inline' then the tag 'members' is the array of possible values. If the type is 'resource' or 'file' then the tag 'reference' is the file/resource that contains the list of values. Note: the list of possible values is required to be upper case and encoded in UTF-8.
Code plugins are implemented via a Java class. This class will typically either extend LogicalTypeInfinite for types with a large number of members, or extend LogicalTypeFinite for a type with a finite number of members. For a simple example refer to the code to detect IPv4 addresses (IPV4Address.java) or the sample PluginColor.java.
456
+
457
+
#### Key methods
458
+
isCandidate() - Fast check to see if the input might be an instance of this Semantic type.
459
+
460
+
isValid() - Is the supplied input an instance of this Semantic type?
461
+
462
+
getRegExp() - The Regular Expression that most closely matches this Semantic Type.
463
+
464
+
getConfidence() - Will default to the number of valid samples / size of the sample set. This is commonly overridden to bias the confidence based on the field name.
465
+
466
+
nextRandom() - Will generate a random (secure) valid example of this Semantic Type.
467
+
468
+
analyzeSet() - Given the data set analyzed determine if this set is likely an instance of this Semantic Type.
469
+
409
470
## Invalid Set ##
410
471
411
472
An invalid entry is one that is not valid for the detected type and/or Semantic type.
@@ -453,7 +514,7 @@ The DataSignature will be identical for AccountLocation and PrimaryCountry as th
453
514
454
515
Additional attributes captured in JSON structure:
455
516
- Included if statistics are enabled: min, max, mean, standardDeviation, topK, bottomK
456
-
- Included if Base Type == Double: decimalSeparator
517
+
- Included if Base Type is Double: decimalSeparator
457
518
- Included if Base Type is Numeric: leadingZeroCount
458
519
- Included if Base Type is Date: dateResolutionMode
459
520
@@ -536,11 +597,11 @@ Within the specification the type is required and can either be a Semantic Type
536
597
- format - the format for outputting this field (e.g. %03d for a LONG)
537
598
- distribution - the distribution of the samples (gaussian, monotonic_increasing, monotonic_decreasing; the default is normal)
538
599
- nullPercent - the percentage of nulls in this field
539
-
-blankPerent - the percentage of blanks in this field
600
+
-blankPercent - the percentage of blanks in this field
540
601
- values - for an STRING type, the possible set of values can be specified
541
602
542
603
## Merging Analyses ##
543
-
FTA supports merging of analyses run on distinct data shards. So for example, if part of the data to be profiled resides on one shard and the balance on a separate shard then FTA can be invoked on each shard separately and then merged. To accomplish this individual analyses should be executed (with similar configurations), the resulting serialized forms should then be deserialized on a common node and merged. Refer to the Merge example for further details.
604
+
FTA supports merging of analyses run on distinct data shards. So for example, if part of the data to be profiled resides on one shard and the balance on a separate shard then FTA can be invoked on each shard separately and then merged. To accomplish this, individual analyses should be executed (with similar configurations), the resulting serialized forms should then be deserialized on a common node and merged. Refer to the Merge example for further details.
544
605
545
606
The accuracy of the merge is determined by the cardinality of the two individual shards, and falls into one of the the following three cases:
546
607
- cardinality(one) + cardinality(two) < max cardinality
0 commit comments