You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ENH: Add support for biasing detection based on composite name (Table/File) (Issue #154) + ENH: Improve ZIP detection if have data of the form 99999- (i.e. a trailing hyphen)
Copy file name to clipboardExpand all lines: README.md
+29-3Lines changed: 29 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -273,7 +273,7 @@ In addition to detecting a set of Base types FTA will also, when enabled (defaul
273
273
like phone numbers, but that are in fact invalid, will NOT be detected as the Semantic Type TELEPHONE.
274
274
* The set of Semantic Types detected is dependent on the current locale
275
275
* The data stream name (e.g. the database field name or CSV field name) is commonly used to bias the detection. For example, if the locale language is English and the data stream matches the regular expression '.\*(?i)(surname|last.?name|lname|maiden.?name|name.?last|last_nm).\*|last' then the detection is more likely to declare this stream a NAME.LAST Semantic Type. The data stream name can also be used to negatively bias the detection. Consult the plugins.json file for more details.
276
-
* Assuming the entire set of stream names is available, Semantic Type detection of a particular column may be impacted by other stream names, for example the Semantic Type PERSON.AGE is detected if we detect another field of type GENDER or NAME.FIRST.
276
+
* Assuming the entire set of stream names is available, Semantic Type detection of a particular column may be impacted by other stream names, for example the Semantic Type PERSON.AGE is commonly detected if we detect another field of type GENDER or NAME.FIRST.
277
277
* When using Record mode for Semantic Type analysis - the detection of Semantic Types for a stream may be impacted by prior determination of the Semantic Type of another Stream (either via detection or provided with the Context)
278
278
* By default analysis is performed on the initial 4096 characters of the field (adjustable via setMaxInputLength()).
279
279
* If two Semantic Types have equal confidence then the Semantic Type with the highest priority will be selected.
@@ -302,7 +302,7 @@ There are three basic types of plugins:
302
302
* Finite (list) - captures any finite type (e.g. ISO-3166-2 (Country codes), US States, ...). Implemented via a supplied list with the valid elements enumerated.
303
303
* Code (java) - captures any complex type (e.g. Even numbers, Credit Cards numbers). Implemented via a Java Class.
304
304
305
-
Note: The Context (the current Stream Name and other field names) can be used to bias detection of the incoming data and/or solely determine the detection.
305
+
Note: The Context (which includes the composite name (Table/File), the current stream names, and the other stream names) can be used to bias detection of the incoming data and/or solely determine the detection.
306
306
307
307
```json
308
308
[
@@ -406,7 +406,7 @@ The optional 'minSamples' tag indicates that in order for this Semantic Type to
406
406
407
407
The optional 'invalidList' tag is a list of invalid values for this Semantic Type, for example '[ "000-00-0000" ]' indicates that this is an invalid SSN, despite the fact that it matches the SSN regular expression.
408
408
409
-
#### Example
409
+
#### Examples
410
410
The following example is looking for an Indian Postal Code. In this case the header is mandatory so we will insist on both detecting a regular expression of the form '\d{6}' and a case independent match for the header. The plugin will return '[1-9]\\d{5}' as it is illegal to have a leading zero for an Indian Postal Code.
411
411
412
412
```json
@@ -428,6 +428,28 @@ The following example is looking for an Indian Postal Code. In this case the he
428
428
}
429
429
```
430
430
431
+
The following example is looking for a Persons age. The plugin uses two possible regular expressions the first is an example of how to bias using a combination of the Composite name (Table or File) as well as the Stream name (Column or Field), the second is simply using the Stream name. If 'compositeKey' is set to true then the input provided to match will be the concatenation of the Composite name with a period with the Stream name, this is commonly useful when the input source is a Database. For example, if the Table name was 'Person' and the field name was 'Age' then the Composite key would be 'Person.Age' which matches the regular expression and hence we have a 100% confidence that this is a Person's age. The second regular expression only considers the Stream name and is only 90% confident that it is a Person's age. In this case the plugin insists on another field that indicates we have found a Person (e.g. Gender, FirstName) so as not to misdetect if the 'Age' was the age of a building/pet etc..
0 commit comments