You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ Contributions are very welcome, as are feature requests and suggestions. Please
46
46
There are several other packages for reading CSV files in Julia, which may suit your needs better:
47
47
48
48
* The standard library contains [DelimitedFiles.jl](https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/), at least until Julia 1.8.
49
-
This returns a `Matrix` rather than a [Tables.jl](https://github.com/JuliaData/Tables.jl)-style container, thus works best for files of homogenous element type.
49
+
This returns a `Matrix` rather than a [Tables.jl](https://github.com/JuliaData/Tables.jl)-style container, thus works best for files of homogeneous element type.
50
50
On large files, CSV.jl will be much faster.
51
51
52
52
*[CSVFiles.jl](https://github.com/queryverse/CSVFiles.jl) uses the [FileIO.jl](https://github.com/JuliaIO/FileIO.jl)'s `load` / `save` API,
Copy file name to clipboardExpand all lines: docs/src/index.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,8 +22,8 @@ to load the package.
22
22
To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:
23
23
24
24
* [`CSV.File`](@ref): the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns a `CSV.File` object, which is like a lightweight table/DataFrame. Assuming `file` is a variable of a `CSV.File` object, individual columns can be accessed like `file.col1`, `file[:col1]`, or `file["col"]`. You can see parsed column names via `file.names`. A `CSV.File` can also be iterated, where a `CSV.Row` is produced on each iteration, which allows access to each value in the row via `row.col1`, `row[:col1]`, or `row[1]`. You can also index a `CSV.File` directly, like `file[1]` to return the entire `CSV.Row` at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated. `CSV.File` satisfies the [Tables.jl](https://github.com/JuliaData/Tables.jl) "source" interface, and so can be passed to valid sink functions like `DataFrame`, `SQLite.load!`, `Arrow.write`, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.
25
-
*[`CSV.read`](@ref): a convenience function identical to `CSV.File`, but used when a `CSV.File` will be passed direclty to a sink function, like a `DataFrame`. In some cases, sinks may make copies of incoming data for their own safety; by calling `CSV.read(file, DataFrame)`, no copies of the parsed `CSV.File` will be made, and the `DataFrame` will take direct ownership of the `CSV.File`'s columns, which is more efficient than doing `CSV.File(file) |> DataFrame` which will result in an extra copy of each column being made. Keyword arguments are identical to `CSV.File`. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like `CSV.File`, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenanted. Each input must have identical schemas (column names and types).
26
-
* [`CSV.Rows`](@ref): an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footrpint than `CSV.File`. Supports many of the same options as `CSV.File`, except column type handling is a little different. By default, every column type will be essentially `Union{Missing, String}`, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a `CSV.Rows` object, rows can be "streamed" by iterating, where each iteration produces a `CSV.Row2` object, which operates similar to `CSV.File`'s `CSV.Row` type where individual row values can be accessed via `row.col1`, `row[:col1]`, or `row[1]`. If each row is processed individually, additional memory can be saved by passing `reusebuffer=true`, which means a single buffer will be allocated to hold the values of only the currently iterated row. `CSV.Rows` also supports the Tables.jl interface and can also be passed to valid sink functions.
25
+
*[`CSV.read`](@ref): a convenience function identical to `CSV.File`, but used when a `CSV.File` will be passed directly to a sink function, like a `DataFrame`. In some cases, sinks may make copies of incoming data for their own safety; by calling `CSV.read(file, DataFrame)`, no copies of the parsed `CSV.File` will be made, and the `DataFrame` will take direct ownership of the `CSV.File`'s columns, which is more efficient than doing `CSV.File(file) |> DataFrame` which will result in an extra copy of each column being made. Keyword arguments are identical to `CSV.File`. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. Like `CSV.File`, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).
26
+
* [`CSV.Rows`](@ref): an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint than `CSV.File`. Supports many of the same options as `CSV.File`, except column type handling is a little different. By default, every column type will be essentially `Union{Missing, String}`, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing a `CSV.Rows` object, rows can be "streamed" by iterating, where each iteration produces a `CSV.Row2` object, which operates similar to `CSV.File`'s `CSV.Row` type where individual row values can be accessed via `row.col1`, `row[:col1]`, or `row[1]`. If each row is processed individually, additional memory can be saved by passing `reusebuffer=true`, which means a single buffer will be allocated to hold the values of only the currently iterated row. `CSV.Rows` also supports the Tables.jl interface and can also be passed to valid sink functions.
27
27
*[`CSV.Chunks`](@ref): similar to `CSV.File`, but allows passing a `ntasks::Integer` keyword argument which will cause the input file to be "chunked" up into `ntasks` number of chunks. After constructing a `CSV.Chunks` object, each iteration of the object will return a `CSV.File` of the next parsed chunk. Useful for processing extremely large files in "chunks". Because each iterated element is a valid Tables.jl "source", `CSV.Chunks` satisfies the `Tables.partitions` interface, so sinks that can process input partitions can operate by passing `CSV.Chunks` as the "source".
28
28
*[`CSV.write`](@ref): A valid Tables.jl "sink" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out a `DataFrame`, for example, it's simply `CSV.write("data.csv", df)`, or to write out a matrix, it's `using Tables; CSV.write("data.csv", Tables.table(mat))`
29
29
*[`CSV.RowWriter`](@ref): An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.
Copy file name to clipboardExpand all lines: src/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ So the general strategy is to get the overall `CSV.Context` for a delimited file
18
18
19
19
CSV.jl provides a native integration with the [PooledArrays.jl](https://github.com/JuliaData/PooledArrays.jl/) package, which provides an array storage optimization by having a (hopefully) small pool of "expensive" (big or heap-allocated, or whatever) values, along with a memory-efficient integer array of "refs" where each ref maps to one of the values in the pool. This is sometimes referred to as a "dictionary encoding" in various data formats. As an example, if you have a column with 1,000,000 elements, but only 10 unique string values, you can have a `Vector{String}` pool to store the 10 unique strings and give each a unique `UInt32` value, and a `Vector{UInt32}` "ref array" for the million elements, where each element just indexes into the pool to get the actual string value.
20
20
21
-
By providing the `pool` keyword argument, users can control how this optimization will be applied to individual columns, or to all columns of the delimted text being read.
21
+
By providing the `pool` keyword argument, users can control how this optimization will be applied to individual columns, or to all columns of the delimited text being read.
22
22
23
23
Valid inputs for `pool` include:
24
24
* A `Bool`, `true` or `false`, which will apply to all string columns parsed; string columns either will _all_ be pooled, or _all_ not pooled
0 commit comments