Skip to content

Commit fd825c8

Browse files
- static method in Schema to infer a schema from various data sources
- Bug fixes in schema inferral
1 parent 9fc3ded commit fd825c8

File tree

12 files changed

+581
-76
lines changed

12 files changed

+581
-76
lines changed

.gitignore

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,10 @@
1-
MainTest.java
2-
nbproject
31

42
# Compiled class file
53
*.class
64

75
# Log file
86
*.log
97

10-
# BlueJ files
11-
*.ctxt
12-
13-
# Mobile Tools for Java (J2ME)
14-
.mtj.tmp/
15-
168
# Package Files #
179
*.jar
1810
*.war
@@ -33,7 +25,7 @@ hs_err_pid*
3325
.project
3426
.settings/
3527

36-
37-
3828
/bin/
3929
/schema.json
30+
/test.csv
31+
/test.json

docs/creating-schemas.md

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
- [Creating from scratch via Java methods](#via-java-methods)
44
- [Creating from a serialized JSON representation](#from-json)
5-
- [Creating from sample data (inferring)](#inferring-a-schema-from-data)
5+
- [Creating from sample data (inferring)](#auto-generating-a-schema-from-data-inferring)
66
- [Schema validation](#schema-validation)
77
- [Writing a Schema to a File](#writing-a-schema-to-a-file)
88

@@ -72,9 +72,41 @@ String schemaFilePath = "/path/to/schema/file/shema.json";
7272
Schema schema = new Schema(schemaFilePath, true); // enforce validation with strict=true.
7373
```
7474

75-
## Inferring a Schema from data
75+
## Auto-generating a Schema from data (inferring)
7676

77-
If you don't have a schema for a CSV and don't want to manually define one then you can generate it:
77+
If you don't have a schema for a CSV and don't want to manually define one then you can auto-generate it:
78+
79+
```java
80+
String csvData = "id,name,age\n1,John,30\n2,Jane,25\n3,Bob,35";
81+
Schema schema = Schema.infer(csvData, StandardCharsets.UTF_8);
82+
```
83+
The type inferral algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned.
84+
The inferral algorithm traverses all of the table's rows and attempts to cast every single value of the table.
85+
86+
You can infer a schema from
87+
- a CSV file
88+
- a URL pointing to a CSV file
89+
- a CSV containing String
90+
- a JSON array node
91+
- a String array containing multiple CSV data sets
92+
- a File List
93+
- a URL List
94+
- mixed data types (CSV, JSON, etc.)
95+
96+
If you have more than one CSV file, you can use `infer()` to check that all files have the same schema:
97+
98+
```java
99+
File testFile = getResourceFile("/testsuite-data/files/csv/1mb.csv");
100+
File testFile2 = getResourceFile("/testsuite-data/files/csv/10mb.csv");
101+
List<File> fileList = Arrays.asList(testFile, testFile2);
102+
103+
Schema schema = Schema.infer(fileList, StandardCharsets.UTF_8);
104+
```
105+
106+
If the CSV files have different headers, the `Schema.infer()` call will throw an Exception.
107+
108+
In case you want to infer a schema and then use the data, it can be helpful to not use the static `Schema.infer()`
109+
method, but first create a `Table` instance and then infer the schema from it.
78110

79111
```java
80112
URL url = new URL("https://raw.githubusercontent.com/frictionlessdata/tableschema-java/master" +
@@ -88,8 +120,7 @@ System.out.println(schema.asJson());
88120

89121
```
90122

91-
The type inferral algorithm tries to cast to available types and each successful type casting increments a popularity score for the successful type cast in question. At the end, the best score so far is returned.
92-
The inferral algorithm traverses all of the table's rows and attempts to cast every single value of the table. When dealing with large tables, you might want to limit the number of rows that the inferral algorithm processes:
123+
When dealing with large tables, you might want to limit the number of rows that the inferral algorithm processes:
93124

94125
```java
95126
// Only process the first 25 rows for type inferral.

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<modelVersion>4.0.0</modelVersion>
44
<groupId>io.frictionlessdata</groupId>
55
<artifactId>tableschema-java</artifactId>
6-
<version>0.9.4-SNAPSHOT</version>
6+
<version>0.9.5-SNAPSHOT</version>
77
<packaging>jar</packaging>
88
<issueManagement>
99
<url>https://github.com/frictionlessdata/tableschema-java/issues</url>

src/main/java/io/frictionlessdata/tableschema/field/DateField.java

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,12 @@ public LocalDate parseValue(String value, String format, Map<String, Object> opt
5252
would correspond to dates like: 30/11/14
5353
*/
5454
String regex = parseDateFormat(format);
55-
DateTimeFormatter formatter = DateTimeFormatter.ofPattern(regex);
56-
return LocalDate.from(formatter.parse(value));
55+
try {
56+
DateTimeFormatter formatter = DateTimeFormatter.ofPattern(regex);
57+
return LocalDate.from(formatter.parse(value));
58+
} catch (Exception ex) {
59+
throw new TypeInferringException("Invalid date format: " + format);
60+
}
5761
}
5862
throw new TypeInferringException();
5963
}

src/main/java/io/frictionlessdata/tableschema/field/GeopointField.java

Lines changed: 68 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -31,64 +31,91 @@ public GeopointField(String name, String format, String title, String descriptio
3131
public double[] parseValue(String value, String format, Map<String, Object> options)
3232
throws TypeInferringException {
3333
try{
34-
if(format.equalsIgnoreCase(Field.FIELD_FORMAT_DEFAULT)){
35-
String[] geopoint = value.split(", *");
36-
37-
if(geopoint.length == 2){
38-
double lon = Double.parseDouble(geopoint[0]);
39-
double lat = Double.parseDouble(geopoint[1]);
40-
41-
// No exception? It's a valid geopoint object.
42-
return new double[]{lon, lat};
34+
if (format.equalsIgnoreCase(Field.FIELD_FORMAT_DEFAULT)){
35+
return parseDefaultString(value);
36+
} else if (format.equalsIgnoreCase(Field.FIELD_FORMAT_ARRAY)){
37+
return parseArrayString(value);
38+
} else if(format.equalsIgnoreCase(Field.FIELD_FORMAT_OBJECT)) {
39+
return parseObjectString(value);
40+
}
41+
} catch(Exception e){
42+
if (e instanceof TypeInferringException) {
43+
throw e;
44+
}
45+
throw new TypeInferringException(e);
46+
}
47+
throw new TypeInferringException("Invalid format for geopoint field: " + format);
48+
}
4349

44-
}else{
45-
throw new TypeInferringException("Geo points must have two coordinates");
50+
@Override
51+
public boolean isCompatibleValue(String value, String format) {
52+
try {
53+
parseDefaultString(value);
54+
return true;
55+
} catch (Exception ex) {
56+
try {
57+
parseArrayString(value);
58+
return true;
59+
} catch (Exception ex2) {
60+
try {
61+
parseObjectString(value);
62+
return true;
63+
} catch (Exception ex3) {
64+
return false;
4665
}
66+
}
67+
}
68+
}
4769

48-
}else if(format.equalsIgnoreCase(Field.FIELD_FORMAT_ARRAY)){
70+
private static double[] parseDefaultString(String value) throws TypeInferringException {
71+
String[] geopoint = value.split(", *");
4972

50-
// This will throw an exception if the value is not an array.
51-
ArrayNode jsonArray = JsonUtil.getInstance().createArrayNode(value);
73+
if(geopoint.length == 2){
74+
double lon = Double.parseDouble(geopoint[0]);
75+
double lat = Double.parseDouble(geopoint[1]);
5276

53-
if (jsonArray.size() == 2){
54-
double lon = jsonArray.get(0).asDouble();
55-
double lat = jsonArray.get(1).asDouble();
77+
// No exception? It's a valid geopoint object.
78+
return new double[]{lon, lat};
5679

57-
// No exception? It's a valid geopoint object.
58-
return new double[]{lon, lat};
80+
}else{
81+
throw new TypeInferringException("Geo points must have two coordinates");
82+
}
83+
}
5984

60-
}else{
61-
throw new TypeInferringException("Geo points must have two coordinates");
62-
}
85+
private static double[] parseArrayString(String value) throws TypeInferringException {
86+
// This will throw an exception if the value is not an array.
87+
ArrayNode jsonArray = JsonUtil.getInstance().createArrayNode(value);
6388

64-
}else if(format.equalsIgnoreCase(Field.FIELD_FORMAT_OBJECT)){
89+
if (jsonArray.size() == 2){
90+
double lon = jsonArray.get(0).asDouble();
91+
double lat = jsonArray.get(1).asDouble();
6592

66-
// This will throw an exception if the value is not an object.
67-
JsonNode jsonObj = JsonUtil.getInstance().createNode(value);
93+
// No exception? It's a valid geopoint object.
94+
return new double[]{lon, lat};
6895

69-
if (jsonObj.size() == 2 && jsonObj.has("lon") && jsonObj.has("lat")){
70-
double lon = jsonObj.get("lon").asDouble();
71-
double lat = jsonObj.get("lat").asDouble();
96+
}else{
97+
throw new TypeInferringException("Geo points must have two coordinates");
98+
}
99+
}
72100

73-
// No exception? It's a valid geopoint object.
74-
return new double[]{lon, lat};
101+
private static double[] parseObjectString(String value) throws TypeInferringException {
102+
// This will throw an exception if the value is not an object.
103+
JsonNode jsonObj = JsonUtil.getInstance().createNode(value);
75104

76-
}else{
77-
throw new TypeInferringException();
78-
}
105+
if (jsonObj.size() == 2 && jsonObj.has("lon") && jsonObj.has("lat")){
106+
double lon = jsonObj.get("lon").asDouble();
107+
double lat = jsonObj.get("lat").asDouble();
79108

80-
}else{
81-
throw new TypeInferringException();
82-
}
109+
// No exception? It's a valid geopoint object.
110+
return new double[]{lon, lat};
83111

84-
}catch(Exception e){
85-
if (e instanceof TypeInferringException) {
86-
throw e;
87-
}
88-
throw new TypeInferringException(e);
112+
}else{
113+
throw new TypeInferringException();
89114
}
115+
90116
}
91117

118+
92119
@Override
93120
public String formatValueAsString(double[] value, String format, Map<String, Object> options) throws InvalidCastException, ConstraintsException {
94121
if ((null == format) || (format.equalsIgnoreCase(Field.FIELD_FORMAT_DEFAULT))){

0 commit comments

Comments
 (0)