Skip to content

Commit f5d692c

Browse files
- Minor change to Schema.infer()
1 parent 70b0908 commit f5d692c

File tree

4 files changed

+37
-10
lines changed

4 files changed

+37
-10
lines changed

docs/creating-schemas.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ You can infer a schema from
9393
- a URL List
9494
- mixed data types (CSV, JSON, etc.)
9595

96-
If you have more than one CSV file, you can use `infer()` to check that all files have the same schema:
96+
If you have more than one CSV file, you can use `Schema.infer()` to check that all files have the same schema:
9797

9898
```java
9999
File testFile = getResourceFile("/testsuite-data/files/csv/1mb.csv");
@@ -103,9 +103,10 @@ List<File> fileList = Arrays.asList(testFile, testFile2);
103103
Schema schema = Schema.infer(fileList, StandardCharsets.UTF_8);
104104
```
105105

106-
If the CSV files have different headers, the `Schema.infer()` call will throw an Exception.
106+
If the CSV files have different headers, the `Schema.infer()` call will throw an Exception because there
107+
is no common schema that can be inferred from the files.
107108

108-
In case you want to infer a schema and then use the data, it can be helpful to not use the static `Schema.infer()`
109+
In case you want to infer a schema from a file and then use the data, it can be helpful to not use the static `Schema.infer()`
109110
method, but first create a `Table` instance and then infer the schema from it.
110111

111112
```java

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<modelVersion>4.0.0</modelVersion>
44
<groupId>io.frictionlessdata</groupId>
55
<artifactId>tableschema-java</artifactId>
6-
<version>0.9.5-SNAPSHOT</version>
6+
<version>0.9.6-SNAPSHOT</version>
77
<packaging>jar</packaging>
88
<issueManagement>
99
<url>https://github.com/frictionlessdata/tableschema-java/issues</url>

src/main/java/io/frictionlessdata/tableschema/Table.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -640,7 +640,7 @@ public void validate() throws TableValidationException, TableSchemaException {
640640
* for the Field in question. At the end, the best score so far is returned.
641641
*
642642
* This method iterates through the whole data set, which can be very costly for huge
643-
* CSV/JSON files
643+
* CSV/JSON files. In that case, use the {@link #inferSchema(int)} method to set a row limit
644644
*
645645
* For {@link BeanSchema}, the operation is much less costly, it is simply done via reflection
646646
* on the Bean class.
@@ -662,6 +662,8 @@ public Schema inferSchema() throws TypeInferringException{
662662
* For {@link BeanSchema}, the operation is simply done via reflection
663663
* on the Bean class, so the `rowLimit`does not have any effect.
664664
*
665+
* @param rowLimit The max numer of rows to scan. Huge input files can take a considerable time to infer.
666+
*
665667
* @return the created Schema
666668
*
667669
*/

src/main/java/io/frictionlessdata/tableschema/schema/Schema.java

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -237,21 +237,45 @@ public static Schema infer(List<Object[]> data, String[] headers, int rowLimit)
237237
* (direct data, files, or URLs), creating tables from each source, and then inferring
238238
* schemas from those tables. All inferred schemas must be equal, otherwise an exception
239239
* is thrown.
240+
* This method can incur a significant performance penalty for large data sets, in that case use the
241+
* overloaded method with a row limit.
240242
*
241243
* @param data Direct data source - can be a String containing table data or an ArrayNode
242244
* containing JSON representation of table data. May be null if using file or URL sources.
243-
* @param charset The character encoding to use when reading from URLs. Used for URL streams only.
244-
* @return The inferred Schema that is consistent across all provided data sources
245+
* @param charset The character encoding to use when reading from URLs. Used for URL streams only.
246+
*
247+
* @return The inferred Schema that is consistent across all provided data sources
248+
* @throws IllegalStateException if no valid data source is provided, if the data type is not supported,
249+
* or if schemas inferred from different sources are not equal
250+
* @throws RuntimeException if an IOException occurs while reading from files or URLs
251+
*/
252+
public static Schema infer(Object data, Charset charset) {
253+
return infer(data, charset, -1);
254+
}
255+
256+
/**
257+
* Infers a table schema from various data sources.
258+
*
259+
* This method attempts to infer a schema by reading data from one or more sources
260+
* (direct data, files, or URLs), creating tables from each source, and then inferring
261+
* schemas from those tables. All inferred schemas must be equal, otherwise an exception
262+
* is thrown.
263+
*
264+
* @param data Direct data source - can be a String containing table data or an ArrayNode
265+
* containing JSON representation of table data. May be null if using file or URL sources.
266+
* @param charset The character encoding to use when reading from URLs. Used for URL streams only.
267+
* @param rowLimit The max numer of rows to scan. Huge input files can take a considerable time to infer.
268+
* @return The inferred Schema that is consistent across all provided data sources
245269
* @throws IllegalStateException if no valid data source is provided, if the data type is not supported,
246270
* or if schemas inferred from different sources are not equal
247271
* @throws RuntimeException if an IOException occurs while reading from files or URLs
248272
*/
249273
public static Schema infer(
250274
Object data,
251-
Charset charset) {
275+
Charset charset,
276+
int rowLimit) {
252277
List<File> paths = new ArrayList<>();
253278
List<URL> urls = new ArrayList<>();
254-
// Infer schema from data source
255279
List<String> s = new ArrayList<>();
256280
if (data != null) {
257281
if (data instanceof String) {
@@ -349,7 +373,7 @@ public static Schema infer(
349373
for (String str : s) {
350374
Table table = Table.fromSource(str);
351375
String[] headers = table.getHeaders();
352-
Schema schema = table.inferSchema(headers, -1);
376+
Schema schema = table.inferSchema(headers, rowLimit);
353377
schemas.add(schema);
354378
}
355379
Schema lastSchema = null;

0 commit comments

Comments
 (0)