Skip to content

Commit e31cb71

Browse files
committed
Detect CSV/TSV column types by default, refs #679
1 parent 0bbc680 commit e31cb71

File tree

5 files changed

+106
-42
lines changed

5 files changed

+106
-42
lines changed

docs/changelog.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Unreleased
1111

1212
- The ``table.insert_all()`` and ``table.upsert_all()`` methods can now accept an iterator of lists or tuples as an alternative to dictionaries. The first item should be a list/tuple of column names. See :ref:`python_api_insert_lists` for details. (:issue:`672`)
1313
- **Breaking change:** The default floating point column type has been changed from ``FLOAT`` to ``REAL``, which is the correct SQLite type for floating point values. This affects auto-detected columns when inserting data. (:issue:`645`)
14+
- **Breaking change:** Type detection is now the default behavior for the ``insert`` and ``upsert`` CLI commands when importing CSV or TSV data. Previously all columns were treated as ``TEXT`` unless the ``--detect-types`` flag was passed. Use the new ``--no-detect-types`` flag to restore the old behavior. The ``SQLITE_UTILS_DETECT_TYPES`` environment variable has been removed. (:issue:`679`)
1415

1516
.. _v4_0a0:
1617

docs/cli-reference.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,8 @@ See :ref:`cli_inserting_data`, :ref:`cli_insert_csv_tsv`, :ref:`cli_insert_unstr
285285
--alter Alter existing table to add any missing columns
286286
--not-null TEXT Columns that should be created as NOT NULL
287287
--default <TEXT TEXT>... Default value that should be set for a column
288-
-d, --detect-types Detect types for columns in CSV/TSV data
288+
-d, --detect-types Detect types for columns in CSV/TSV data (default)
289+
--no-detect-types Treat all CSV/TSV columns as TEXT
289290
--analyze Run ANALYZE at the end of this operation
290291
--load-extension TEXT Path to SQLite extension, with optional :entrypoint
291292
--silent Do not show progress bar
@@ -342,7 +343,8 @@ See :ref:`cli_upsert`.
342343
--alter Alter existing table to add any missing columns
343344
--not-null TEXT Columns that should be created as NOT NULL
344345
--default <TEXT TEXT>... Default value that should be set for a column
345-
-d, --detect-types Detect types for columns in CSV/TSV data
346+
-d, --detect-types Detect types for columns in CSV/TSV data (default)
347+
--no-detect-types Treat all CSV/TSV columns as TEXT
346348
--analyze Run ANALYZE at the end of this operation
347349
--load-extension TEXT Path to SQLite extension, with optional :entrypoint
348350
--silent Do not show progress bar

docs/cli.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -508,7 +508,7 @@ Incoming CSV data will be assumed to use ``utf-8``. If your data uses a differen
508508
509509
If you are joining across multiple CSV files they must all use the same encoding.
510510

511-
Column types will be automatically detected in CSV or TSV data, using the same mechanism as ``--detect-types`` described in :ref:`cli_insert_csv_tsv`. You can pass the ``--no-detect-types`` option to disable this automatic type detection and treat all CSV and TSV columns as ``TEXT``.
511+
Column types will be automatically detected in CSV or TSV data, as described in :ref:`cli_insert_csv_tsv`. You can pass the ``--no-detect-types`` option to disable this automatic type detection and treat all CSV and TSV columns as ``TEXT``.
512512

513513
.. _cli_memory_explicit:
514514

@@ -1263,7 +1263,7 @@ To stop inserting after a specified number of records - useful for getting a fas
12631263
12641264
A progress bar is displayed when inserting data from a file. You can hide the progress bar using the ``--silent`` option.
12651265

1266-
By default every column inserted from a CSV or TSV file will be of type ``TEXT``. To automatically detect column types - resulting in a mix of ``TEXT``, ``INTEGER`` and ``REAL`` columns, use the ``--detect-types`` option (or its shortcut ``-d``).
1266+
By default, column types are automatically detected for CSV or TSV files - resulting in a mix of ``TEXT``, ``INTEGER`` and ``REAL`` columns. To disable type detection and treat all columns as ``TEXT``, use the ``--no-detect-types`` option.
12671267

12681268
For example, given a ``creatures.csv`` file containing this:
12691269

@@ -1277,9 +1277,9 @@ The following command:
12771277

12781278
.. code-block:: bash
12791279
1280-
sqlite-utils insert creatures.db creatures creatures.csv --csv --detect-types
1280+
sqlite-utils insert creatures.db creatures creatures.csv --csv
12811281
1282-
Will produce this schema:
1282+
Will produce this schema with automatically detected types:
12831283

12841284
.. code-block:: bash
12851285
@@ -1293,11 +1293,11 @@ Will produce this schema:
12931293
"weight" REAL
12941294
);
12951295
1296-
You can set the ``SQLITE_UTILS_DETECT_TYPES`` environment variable if you want ``--detect-types`` to be the default behavior:
1296+
To disable type detection and treat all columns as TEXT, use ``--no-detect-types``:
12971297

12981298
.. code-block:: bash
12991299
1300-
export SQLITE_UTILS_DETECT_TYPES=1
1300+
sqlite-utils insert creatures.db creatures creatures.csv --csv --no-detect-types
13011301
13021302
If a CSV or TSV file includes empty cells, like this one:
13031303

sqlite_utils/cli.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -898,8 +898,12 @@ def inner(fn):
898898
"-d",
899899
"--detect-types",
900900
is_flag=True,
901-
envvar="SQLITE_UTILS_DETECT_TYPES",
902-
help="Detect types for columns in CSV/TSV data",
901+
help="Detect types for columns in CSV/TSV data (default)",
902+
),
903+
click.option(
904+
"--no-detect-types",
905+
is_flag=True,
906+
help="Treat all CSV/TSV columns as TEXT",
903907
),
904908
click.option(
905909
"--analyze",
@@ -951,6 +955,7 @@ def insert_upsert_implementation(
951955
not_null=None,
952956
default=None,
953957
detect_types=None,
958+
no_detect_types=False,
954959
analyze=False,
955960
load_extension=None,
956961
silent=False,
@@ -1019,7 +1024,8 @@ def insert_upsert_implementation(
10191024
)
10201025
else:
10211026
docs = (dict(zip(headers, row)) for row in reader)
1022-
if detect_types:
1027+
# detect_types is now the default, unless --no-detect-types is passed
1028+
if not no_detect_types:
10231029
tracker = TypeTracker()
10241030
docs = tracker.wrap(docs)
10251031
elif lines:
@@ -1191,6 +1197,7 @@ def insert(
11911197
stop_after,
11921198
alter,
11931199
detect_types,
1200+
no_detect_types,
11941201
analyze,
11951202
load_extension,
11961203
silent,
@@ -1273,6 +1280,7 @@ def insert(
12731280
replace=replace,
12741281
truncate=truncate,
12751282
detect_types=detect_types,
1283+
no_detect_types=no_detect_types,
12761284
analyze=analyze,
12771285
load_extension=load_extension,
12781286
silent=silent,
@@ -1311,6 +1319,7 @@ def upsert(
13111319
not_null,
13121320
default,
13131321
detect_types,
1322+
no_detect_types,
13141323
analyze,
13151324
load_extension,
13161325
silent,
@@ -1356,6 +1365,7 @@ def upsert(
13561365
not_null=not_null,
13571366
default=default,
13581367
detect_types=detect_types,
1368+
no_detect_types=no_detect_types,
13591369
analyze=analyze,
13601370
load_extension=load_extension,
13611371
silent=silent,
@@ -1443,6 +1453,7 @@ def bulk(
14431453
not_null=set(),
14441454
default={},
14451455
detect_types=False,
1456+
no_detect_types=True,
14461457
load_extension=load_extension,
14471458
silent=False,
14481459
bulk_sql=sql,

tests/test_cli.py

Lines changed: 81 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1907,7 +1907,7 @@ def test_insert_encoding(tmpdir):
19071907
# Using --encoding=latin-1 should work
19081908
good_result = CliRunner().invoke(
19091909
cli.cli,
1910-
["insert", db_path, "places", csv_path, "--encoding", "latin-1", "--csv"],
1910+
["insert", db_path, "places", csv_path, "--encoding", "latin-1", "--csv", "--no-detect-types"],
19111911
catch_exceptions=False,
19121912
)
19131913
assert good_result.exit_code == 0
@@ -2245,13 +2245,13 @@ def test_csv_insert_bom(tmpdir):
22452245
fp.write(b"\xef\xbb\xbfname,age\nCleo,5")
22462246
result = CliRunner().invoke(
22472247
cli.cli,
2248-
["insert", db_path, "broken", bom_csv_path, "--encoding", "utf-8", "--csv"],
2248+
["insert", db_path, "broken", bom_csv_path, "--encoding", "utf-8", "--csv", "--no-detect-types"],
22492249
catch_exceptions=False,
22502250
)
22512251
assert result.exit_code == 0
22522252
result2 = CliRunner().invoke(
22532253
cli.cli,
2254-
["insert", db_path, "fixed", bom_csv_path, "--csv"],
2254+
["insert", db_path, "fixed", bom_csv_path, "--csv", "--no-detect-types"],
22552255
catch_exceptions=False,
22562256
)
22572257
assert result2.exit_code == 0
@@ -2263,43 +2263,40 @@ def test_csv_insert_bom(tmpdir):
22632263
]
22642264

22652265

2266-
@pytest.mark.parametrize("option_or_env_var", (None, "-d", "--detect-types"))
2267-
def test_insert_detect_types(tmpdir, option_or_env_var):
2266+
@pytest.mark.parametrize("option", (None, "-d", "--detect-types"))
2267+
def test_insert_detect_types(tmpdir, option):
2268+
"""Test that type detection is now the default behavior"""
22682269
db_path = str(tmpdir / "test.db")
22692270
data = "name,age,weight\nCleo,6,45.5\nDori,1,3.5"
22702271
extra = []
2271-
if option_or_env_var:
2272-
extra = [option_or_env_var]
2272+
if option:
2273+
extra = [option]
22732274

2274-
def _test():
2275-
result = CliRunner().invoke(
2276-
cli.cli,
2277-
["insert", db_path, "creatures", "-", "--csv"] + extra,
2278-
catch_exceptions=False,
2279-
input=data,
2280-
)
2281-
assert result.exit_code == 0
2282-
db = Database(db_path)
2283-
assert list(db["creatures"].rows) == [
2284-
{"name": "Cleo", "age": 6, "weight": 45.5},
2285-
{"name": "Dori", "age": 1, "weight": 3.5},
2286-
]
2287-
2288-
if option_or_env_var is None:
2289-
# Use environment variable instead of option
2290-
with mock.patch.dict(os.environ, {"SQLITE_UTILS_DETECT_TYPES": "1"}):
2291-
_test()
2292-
else:
2293-
_test()
2275+
result = CliRunner().invoke(
2276+
cli.cli,
2277+
["insert", db_path, "creatures", "-", "--csv"] + extra,
2278+
catch_exceptions=False,
2279+
input=data,
2280+
)
2281+
assert result.exit_code == 0
2282+
db = Database(db_path)
2283+
assert list(db["creatures"].rows) == [
2284+
{"name": "Cleo", "age": 6, "weight": 45.5},
2285+
{"name": "Dori", "age": 1, "weight": 3.5},
2286+
]
22942287

22952288

2296-
@pytest.mark.parametrize("option", ("-d", "--detect-types"))
2289+
@pytest.mark.parametrize("option", (None, "-d", "--detect-types"))
22972290
def test_upsert_detect_types(tmpdir, option):
2291+
"""Test that type detection is now the default behavior for upsert"""
22982292
db_path = str(tmpdir / "test.db")
22992293
data = "id,name,age,weight\n1,Cleo,6,45.5\n2,Dori,1,3.5"
2294+
extra = []
2295+
if option:
2296+
extra = [option]
23002297
result = CliRunner().invoke(
23012298
cli.cli,
2302-
["upsert", db_path, "creatures", "-", "--csv", "--pk", "id"] + [option],
2299+
["upsert", db_path, "creatures", "-", "--csv", "--pk", "id"] + extra,
23032300
catch_exceptions=False,
23042301
input=data,
23052302
)
@@ -2312,12 +2309,12 @@ def test_upsert_detect_types(tmpdir, option):
23122309

23132310

23142311
def test_csv_detect_types_creates_real_columns(tmpdir):
2315-
"""Test that CSV import with --detect-types creates REAL columns for floats"""
2312+
"""Test that CSV import creates REAL columns for floats (default behavior)"""
23162313
db_path = str(tmpdir / "test.db")
23172314
data = "name,age,weight\nCleo,6,45.5\nDori,1,3.5"
23182315
result = CliRunner().invoke(
23192316
cli.cli,
2320-
["insert", db_path, "creatures", "-", "--csv", "--detect-types"],
2317+
["insert", db_path, "creatures", "-", "--csv"],
23212318
catch_exceptions=False,
23222319
input=data,
23232320
)
@@ -2333,6 +2330,59 @@ def test_csv_detect_types_creates_real_columns(tmpdir):
23332330
)
23342331

23352332

2333+
def test_insert_no_detect_types(tmpdir):
2334+
"""Test that --no-detect-types treats all columns as TEXT"""
2335+
db_path = str(tmpdir / "test.db")
2336+
data = "name,age,weight\nCleo,6,45.5\nDori,1,3.5"
2337+
result = CliRunner().invoke(
2338+
cli.cli,
2339+
["insert", db_path, "creatures", "-", "--csv", "--no-detect-types"],
2340+
catch_exceptions=False,
2341+
input=data,
2342+
)
2343+
assert result.exit_code == 0
2344+
db = Database(db_path)
2345+
# All columns should be TEXT when --no-detect-types is used
2346+
assert list(db["creatures"].rows) == [
2347+
{"name": "Cleo", "age": "6", "weight": "45.5"},
2348+
{"name": "Dori", "age": "1", "weight": "3.5"},
2349+
]
2350+
assert db["creatures"].schema == (
2351+
'CREATE TABLE "creatures" (\n'
2352+
' "name" TEXT,\n'
2353+
' "age" TEXT,\n'
2354+
' "weight" TEXT\n'
2355+
")"
2356+
)
2357+
2358+
2359+
def test_upsert_no_detect_types(tmpdir):
2360+
"""Test that --no-detect-types treats all columns as TEXT for upsert"""
2361+
db_path = str(tmpdir / "test.db")
2362+
data = "id,name,age,weight\n1,Cleo,6,45.5\n2,Dori,1,3.5"
2363+
result = CliRunner().invoke(
2364+
cli.cli,
2365+
["upsert", db_path, "creatures", "-", "--csv", "--pk", "id", "--no-detect-types"],
2366+
catch_exceptions=False,
2367+
input=data,
2368+
)
2369+
assert result.exit_code == 0
2370+
db = Database(db_path)
2371+
# All columns should be TEXT when --no-detect-types is used
2372+
assert list(db["creatures"].rows) == [
2373+
{"id": "1", "name": "Cleo", "age": "6", "weight": "45.5"},
2374+
{"id": "2", "name": "Dori", "age": "1", "weight": "3.5"},
2375+
]
2376+
assert db["creatures"].schema == (
2377+
'CREATE TABLE "creatures" (\n'
2378+
' "id" TEXT PRIMARY KEY,\n'
2379+
' "name" TEXT,\n'
2380+
' "age" TEXT,\n'
2381+
' "weight" TEXT\n'
2382+
")"
2383+
)
2384+
2385+
23362386
def test_integer_overflow_error(tmpdir):
23372387
db_path = str(tmpdir / "test.db")
23382388
result = CliRunner().invoke(

0 commit comments

Comments
 (0)