Releases: ddotta/parquetize
v0.5.8
v0.5.7
This release includes :
- bugfix by @leungi: remove single quotes in SQL statement thatgenerates incorrect SQL syntax for connection of type Microsoft SQL Server #45
{parquetize}now has a minimal version (2.4.0) for{haven}dependency package to ensure that conversions are performed correctly from SAS files compressed in BINARY mode #46csv_to_parquetnow has aread_delim_argsargument, allowing passing of arguments toread_delim(added by @nikostr).table_to_parquetcan now convert files with uppercase extensions (.SAS7BDAT, .SAV, .DTA)
What's Changed
- Add user_na argument in table_to_parquet function by @ddotta in #44
- fix: remove single quotes in SQL statement by @leungi in #45
- Specify minimal version for haven by @ddotta in #47
- Improves documentation for
csv_to_parquet()for txt files by @ddotta in #48 - Adds argument
read_delim_argstocsv_to_parquetby @nikostr in #49 - table_to_parquet() can now convert files with uppercase extensions by @ddotta in #54
New Contributors
Full Changelog: v0.5.6.1...v0.5.7
v0.5.6.1
This release includes :
fst_to_parquet function
- a new fst_to_parquet function that converts a fst file to parquet format.
Other
- Rely more on
@inheritParamsto simply documentation of functions arguments #38. This leads to some renaming of arguments (e.gpath_to_csv->path_to_file...) - Arguments
compressionandcompression_levelare now passed to write_parquet_at_once and write_parquet_by_chunk functions and now available in main conversion functions ofparquetize#36 - Group
@importFromin a file to facilitate their maintenance #37 - work on download_extract tests #43
Full Changelog: v0.5.6...v0.5.6.1
parquetize 0.5.6
This release includes :
Possibility to use a RDBMS as source
You can convert to parquet any query you want on any DBI compatible RDBMS :
dbi_connection <- DBI::dbConnect(RSQLite::SQLite(),
system.file("extdata","iris.sqlite",package = "parquetize"))
# Reading iris table from local sqlite database
# and conversion to one parquet file :
dbi_to_parquet(
conn = dbi_connection,
sql_query = "SELECT * FROM iris",
path_to_parquet = tempdir(),
parquetname = "iris"
)
You can find more information on dbi_to_parquet documentation.
check_parquet function
- a new check_parquet function that check if a dataset/file is valid and return columns and arrow type
Deprecations
Two arguments are deprecated to avoid confusion with arrow concept and keep consistency
chunk_sizeis replaced bymax_rows(chunk size is an arrow concept).chunk_memory_sizeis replaced bymax_memoryfor consistency
Other
- refactoring : extract the logic to write parquet files as chunk or at once in write_parquet_by_chunk and write_parquet_at_once
- a big test's refactoring : all _to_parquet output files are formally validate (readable as parquet, number of lines, partitions, number of files).
- use cli_abort instead of cli_alert_danger with stop("") everywhere
- some minors changes
- bugfix: table_to_parquet did not select columns as expected
- bugfix: skip_if_offline tests with download
Thanks a lot 🙏 to @nbc for these new improvements 🚀
Full Changelog: v0.5.5...v0.5.6
parquetize 0.5.5
This release includes :
A very important new contributor to parquetize !
Due to these numerous contributions, @nbc is now officially part of the project authors !
Three arguments deprecation
After a big refactoring, three arguments are deprecated :
by_chunk:table_to_parquetwill automatically chunked if you use one ofchunk_memory_sizeorchunk_size.csv_as_a_zip:csv_to_tablewill detect if file is a zip by the extension.url_to_csv: usepath_to_csvinstead,csv_to_tablewill detect if the file is remote with the file path.
They will raise a deprecation warning for the moment.
Chunking by memory size
The possibility to chunk parquet by memory size with table_to_parquet():
table_to_parquet() takes a chunk_memory_size argument to convert an input
file into parquet file of roughly chunk_memory_size Mb size when data are
loaded in memory.
Argument by_chunk is deprecated (see above).
Example of use of the argument chunk_memory_size:
table_to_parquet(
path_to_table = system.file("examples","iris.sas7bdat", package = "haven"),
path_to_parquet = tempdir(),
chunk_memory_size = 5000, # this will create files of around 5Gb when loaded in memory
)
Passing argument like compression to write_parquet when chunking
The functionality for users to pass argument to write_parquet() when
chunking argument (in the ellipsis). Can be used for example to pass
compression and compression_level.
Example:
table_to_parquet(
path_to_table = system.file("examples","iris.sas7bdat", package = "haven"),
path_to_parquet = tempdir(),
compression = "zstd",
compression_level = 10,
chunk_memory_size = 5000
)
A new function download_extract
This function is added to ... download and unzip file if needed.
file_path <- download_extract(
"https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip",
filename_in_zip = "census2021-ts007-ctry.csv"
)
csv_to_parquet(
file_path,
path_to_parquet = tempdir()
)
Other
Under the cover, this release has hardened tests
parquetize 0.5.4
parquetize 0.5.3
This release includes :
- Added columns selection to
table_to_parquet()andcsv_to_parquet()functions #20 - The example files in parquet format of the iris table have been migrated to the
inst/extdatadirectory.
parquetize 0.5.2
This release fixes the behaviour of table_to_parquet() function when the argument by_chunk is TRUE.
parquetize 0.5.1
This release removes duckdb_to_parquet() function on the advice of Brian Ripley from CRAN.
Indeed, the storage of DuckDB is not yet stable. The storage will be stabilized when version 1.0 releases.
parquetize 0.5.0
This release corresponds to the one available on CRAN (0.5.0) 🎉