Skip to content

Commit b0fa943

Browse files
committed
ENH: Add support for biasing detection based on composite name (Table/File) (Issue #154) + ENH: Improve ZIP detection if have data of the form 99999- (i.e. a trailing hyphen)
1 parent e810bb0 commit b0fa943

File tree

80 files changed

+575
-465
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+575
-465
lines changed

ChangeLog.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11

22
## Changes ##
33

4-
### 17.3.0
4+
### 17.4.0
5+
- INT: Bump testng to 7.12.0, logback-classic to 1.5.26
6+
- ENH: Improve ZIP detection if have data of the form 99999- (i.e. a trailing hyphen)
7+
- CLI: Fix bug when validating data based on a plugin
8+
- ENH: Add support for biasing detection based on composite name (Table/File) (Issue #154)
9+
10+
### 17.3.1
511
- INT: Bump logback-classic to 1.5.25, Jackson to 2.21.0
612
- INT: Cleaning.
713
- INT: More gradle 10.0 changes

README.md

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,7 @@ In addition to detecting a set of Base types FTA will also, when enabled (defaul
273273
like phone numbers, but that are in fact invalid, will NOT be detected as the Semantic Type TELEPHONE.
274274
* The set of Semantic Types detected is dependent on the current locale
275275
* The data stream name (e.g. the database field name or CSV field name) is commonly used to bias the detection. For example, if the locale language is English and the data stream matches the regular expression '.\*(?i)(surname|last.?name|lname|maiden.?name|name.?last|last_nm).\*|last' then the detection is more likely to declare this stream a NAME.LAST Semantic Type. The data stream name can also be used to negatively bias the detection. Consult the plugins.json file for more details.
276-
* Assuming the entire set of stream names is available, Semantic Type detection of a particular column may be impacted by other stream names, for example the Semantic Type PERSON.AGE is detected if we detect another field of type GENDER or NAME.FIRST.
276+
* Assuming the entire set of stream names is available, Semantic Type detection of a particular column may be impacted by other stream names, for example the Semantic Type PERSON.AGE is commonly detected if we detect another field of type GENDER or NAME.FIRST.
277277
* When using Record mode for Semantic Type analysis - the detection of Semantic Types for a stream may be impacted by prior determination of the Semantic Type of another Stream (either via detection or provided with the Context)
278278
* By default analysis is performed on the initial 4096 characters of the field (adjustable via setMaxInputLength()).
279279
* If two Semantic Types have equal confidence then the Semantic Type with the highest priority will be selected.
@@ -302,7 +302,7 @@ There are three basic types of plugins:
302302
* Finite (list) - captures any finite type (e.g. ISO-3166-2 (Country codes), US States, ...). Implemented via a supplied list with the valid elements enumerated.
303303
* Code (java) - captures any complex type (e.g. Even numbers, Credit Cards numbers). Implemented via a Java Class.
304304

305-
Note: The Context (the current Stream Name and other field names) can be used to bias detection of the incoming data and/or solely determine the detection.
305+
Note: The Context (which includes the composite name (Table/File), the current stream names, and the other stream names) can be used to bias detection of the incoming data and/or solely determine the detection.
306306

307307
```json
308308
[
@@ -406,7 +406,7 @@ The optional 'minSamples' tag indicates that in order for this Semantic Type to
406406

407407
The optional 'invalidList' tag is a list of invalid values for this Semantic Type, for example '[ "000-00-0000" ]' indicates that this is an invalid SSN, despite the fact that it matches the SSN regular expression.
408408

409-
#### Example
409+
#### Examples
410410
The following example is looking for an Indian Postal Code. In this case the header is mandatory so we will insist on both detecting a regular expression of the form '\d{6}' and a case independent match for the header. The plugin will return '[1-9]\\d{5}' as it is illegal to have a leading zero for an Indian Postal Code.
411411

412412
```json
@@ -428,6 +428,28 @@ The following example is looking for an Indian Postal Code. In this case the he
428428
}
429429
```
430430

431+
The following example is looking for a Persons age. The plugin uses two possible regular expressions the first is an example of how to bias using a combination of the Composite name (Table or File) as well as the Stream name (Column or Field), the second is simply using the Stream name. If 'compositeKey' is set to true then the input provided to match will be the concatenation of the Composite name with a period with the Stream name, this is commonly useful when the input source is a Database. For example, if the Table name was 'Person' and the field name was 'Age' then the Composite key would be 'Person.Age' which matches the regular expression and hence we have a 100% confidence that this is a Person's age. The second regular expression only considers the Stream name and is only 90% confident that it is a Person's age. In this case the plugin insists on another field that indicates we have found a Person (e.g. Gender, FirstName) so as not to misdetect if the 'Age' was the age of a building/pet etc..
432+
433+
```json
434+
{
435+
"semanticType" : "PERSON.AGE",
436+
"description": "Age (person)",
437+
"pluginType": "java",
438+
"clazz": "com.cobber.fta.plugins.person.Age",
439+
"validLocales": [
440+
{
441+
"localeTag": "en",
442+
"headerRegExps": [
443+
{ "regExp": "(?i)(Person|Employee|Client|Customer)\\.(age|age[_ ].*|.*[_ ]age)", "confidence": 100, "mandatory": true, "compositeKey": true },
444+
{ "regExp": "(?i)(age|age[_ ].*|.*[_ ]age)", "confidence": 90, "mandatory": true }
445+
]
446+
}
447+
],
448+
"baseType" : "LONG",
449+
"priority": 98
450+
}
451+
```
452+
431453
### Finite plugins ###
432454

433455
The mandatory 'content' element is required.
@@ -662,6 +684,10 @@ Just one test
662684

663685
`$ ./gradlew types:test --tests TestDates.localeDateTest`
664686

687+
Validate a set of samples (against a known plugin)
688+
689+
`$ cli --pluginMode true --pluginName POSTAL_CODE.ZIP5_US --col 0 <file.csv>`
690+
665691
### Generate JavaDoc ###
666692
`$ ./gradlew javadoc`
667693

core/src/main/java/com/cobber/fta/HeaderLocaleEntry.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ public class HeaderLocaleEntry {
3333
public int getHeaderConfidence(final String input) {
3434
if (headerRegExps != null)
3535
for (final HeaderEntry headerEntry : headerRegExps) {
36-
if (headerEntry.matches(input))
36+
if (headerEntry.matches(null, input))
3737
return headerEntry.confidence;
3838
}
3939

core/src/main/java/com/cobber/fta/core/HeaderEntry.java

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ public class HeaderEntry {
2828
public int confidence;
2929
/** If true then the header must match be present. */
3030
public boolean mandatory;
31+
/** If true then the match against &lt;CompositeName&gt;.&lt;DataStreamName&gt; as opposed to just &lt;DataStreamName&gt;. */
32+
public boolean compositeKey;
3133

3234
/** The pattern is used to cache the compiled regular expression since it will be executed many times. */
3335
private Pattern pattern;
@@ -62,13 +64,16 @@ public String toString() {
6264
return (new StringBuilder()).append('[').append(regExp).append(':').append(confidence).append(':').append(mandatory).append(']').toString();
6365
}
6466

65-
public boolean matches(final String input) {
67+
public boolean matches(final String compositeName, final String dataStreamName) {
6668
if (pattern == null) {
6769
synchronized (this) {
6870
pattern = Pattern.compile(regExp);
6971
}
7072
}
7173

72-
return pattern.matcher(input).matches();
74+
if (!compositeKey || compositeName == null)
75+
return pattern.matcher(dataStreamName).matches();
76+
77+
return pattern.matcher(compositeName + '.' + dataStreamName).matches();
7378
}
7479
}

settings.gradle

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ includeBuild 'examples/core/speed'
4040
dependencyResolutionManagement {
4141
versionCatalogs {
4242
libs {
43-
version('fta', '17.3.1')
43+
version('fta', '17.4.0')
4444
version('jacoco', '0.8.12')
4545

4646
// https://mvnrepository.com/artifact/de.siegmar/fastcsv
@@ -70,13 +70,13 @@ dependencyResolutionManagement {
7070
// https://mvnrepository.com/artifact/com.google.guava/guava
7171
library('guava', 'com.google.guava:guava:33.5.0-jre')
7272
// https://mvnrepository.com/artifact/ch.qos.logback/logback-classic
73-
library('logbackClassic', 'ch.qos.logback:logback-classic:1.5.25')
73+
library('logbackClassic', 'ch.qos.logback:logback-classic:1.5.26')
7474
// https://mvnrepository.com/artifact/com.datadoghq/sketches-java
7575
library('sketches', 'com.datadoghq:sketches-java:0.8.3')
7676
}
7777
testLibs {
7878
// https://mvnrepository.com/artifact/org.testng/testng
79-
library('testng', 'org.testng:testng:7.11.0')
79+
library('testng', 'org.testng:testng:7.12.0')
8080
}
8181
}
8282
}

types/src/main/java/com/cobber/fta/CacheLRU.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ public void put(final K key, final V value) {
3434
cache.put(key, value);
3535
}
3636

37-
public V get(K key) {
37+
public V get(final K key) {
3838
return cache.getIfPresent(key);
3939
}
4040

types/src/main/java/com/cobber/fta/Facts.java

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -371,23 +371,23 @@ public long getMinLong() {
371371
return minLong;
372372
}
373373

374-
public void setMinLong(long minLong) {
374+
public void setMinLong(final long minLong) {
375375
this.minLong = minLong;
376376
}
377377

378378
public long getMaxLong() {
379379
return maxLong;
380380
}
381381

382-
public void setMaxLong(long maxLong) {
382+
public void setMaxLong(final long maxLong) {
383383
this.maxLong = maxLong;
384384
}
385385

386386
public long getMinLongNonZero() {
387387
return minLongNonZero;
388388
}
389389

390-
public void setMinLongNonZero(long minLongNonZero) {
390+
public void setMinLongNonZero(final long minLongNonZero) {
391391
this.minLongNonZero = minLongNonZero;
392392
}
393393

types/src/main/java/com/cobber/fta/LogicalType.java

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,11 +80,36 @@ public boolean initialize(final AnalysisConfig analysisConfig) throws FTAPluginE
8080
/**
8181
* Determine the confidence that the name of the data stream is likely a valid header for this Semantic Type.
8282
* Positive Numbers indicate it could be this Semantic Type, negative numbers indicate is it unlikely to be this Semantic Type, 0 indicates no opinion.
83+
* @param context The context used to interpret the Data Stream (for example, stream name, date resolution mode, etc)
84+
* @return An integer between -100 and 100 reflecting the confidence that this stream name is a valid header.
85+
*/
86+
public int getHeaderConfidence(final AnalyzerContext context) {
87+
return pluginLocaleEntry.getHeaderConfidence(context.getCompositeName(), context.getStreamName());
88+
}
89+
90+
/**
91+
* Determine the confidence that the name of the data stream is likely a valid header for this Semantic Type.
92+
* Positive Numbers indicate it could be this Semantic Type, negative numbers indicate is it unlikely to be this Semantic Type, 0 indicates no opinion.
93+
* @param compositeName The name of this composite (Table/File)
94+
* @param dataStreamName The name of this data stream
95+
* @return An integer between -100 and 100 reflecting the confidence that this stream name is a valid header.
96+
*/
97+
public int getHeaderConfidence(final String compositeName, final String dataStreamName) {
98+
return pluginLocaleEntry.getHeaderConfidence(compositeName, dataStreamName);
99+
}
100+
101+
/**
102+
* Determine the confidence that the name of the data stream is likely a valid header for this Semantic Type.
103+
* Positive Numbers indicate it could be this Semantic Type, negative numbers indicate is it unlikely to be this Semantic Type, 0 indicates no opinion.
104+
*
105+
* @deprecated Replaced by {@link getHeaderConfidence(AnalyzerContext)}
106+
*
83107
* @param dataStreamName The name of this data stream
84108
* @return An integer between -100 and 100 reflecting the confidence that this stream name is a valid header.
85109
*/
110+
@Deprecated
86111
public int getHeaderConfidence(final String dataStreamName) {
87-
return pluginLocaleEntry.getHeaderConfidence(dataStreamName);
112+
return pluginLocaleEntry.getHeaderConfidence(null, dataStreamName);
88113
}
89114

90115
/**

types/src/main/java/com/cobber/fta/LogicalTypeBloomFilter.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ private String backout() {
104104

105105
@Override
106106
public PluginAnalysis analyzeSet(final AnalyzerContext context, final long matchCount, final long realSamples, final String currentRegExp, final Facts facts, final FiniteMap cardinality, final FiniteMap outliers, final TokenStreams tokenStreams, final AnalysisConfig analysisConfig) {
107-
final int headerConfidence = getHeaderConfidence(context.getStreamName());
107+
final int headerConfidence = getHeaderConfidence(context);
108108
if (headerConfidence <= 0 && cardinality.size() < 5)
109109
return new PluginAnalysis(backout());
110110

@@ -119,7 +119,7 @@ public double getConfidence(final long matchCount, final long realSamples, final
119119
double confidence = (double)matchCount/realSamples;
120120

121121
// Boost by up to 20% if we like the header
122-
if (getHeaderConfidence(context.getStreamName()) > 0)
122+
if (getHeaderConfidence(context) > 0)
123123
confidence = Math.min(confidence * 1.2, 1.0);
124124

125125
return confidence;

types/src/main/java/com/cobber/fta/LogicalTypeFiniteSimple.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ public abstract class LogicalTypeFiniteSimple extends LogicalTypeFinite {
3434
protected Reader reader;
3535
protected SingletonSet memberSet;
3636

37-
private static final CacheLRU<String, String> cache = new CacheLRU<>(30);
37+
private static final CacheLRU<String, String> CACHE = new CacheLRU<>(30);
3838

3939
public LogicalTypeFiniteSimple(final PluginDefinition plugin, final String backout, final int threshold) {
4040
super(plugin);
@@ -82,7 +82,7 @@ public boolean initialize(final AnalysisConfig analysisConfig) throws FTAPluginE
8282
// If the Regular Expression has not been set then generate one based on the content
8383
if (regExp == null) {
8484
final String cacheKey = semanticType + "___" + analysisConfig.getLocaleTag();
85-
regExp = cache.get(cacheKey);
85+
regExp = CACHE.get(cacheKey);
8686
if (regExp != null)
8787
return true;
8888

@@ -92,7 +92,7 @@ public boolean initialize(final AnalysisConfig analysisConfig) throws FTAPluginE
9292
gen.train(elt);
9393

9494
regExp = gen.getResult();
95-
cache.put(cacheKey, regExp);
95+
CACHE.put(cacheKey, regExp);
9696
}
9797

9898
return true;
@@ -111,7 +111,7 @@ public String getRegExp() {
111111
@Override
112112
public PluginAnalysis analyzeSet(final AnalyzerContext context, final long matchCount, final long realSamples, final String currentRegExp,
113113
final Facts facts, final FiniteMap cardinality, final FiniteMap outliers, final TokenStreams tokenStreams, final AnalysisConfig analysisConfig) {
114-
final int headerConfidence = getHeaderConfidence(context.getStreamName());
114+
final int headerConfidence = getHeaderConfidence(context);
115115
final int baseOutliers = ((100 - getThreshold()) * getSize())/100;
116116

117117
int maxOutliers = Math.max(1, baseOutliers / 2);

0 commit comments

Comments
 (0)