-
Notifications
You must be signed in to change notification settings - Fork 146
Open
Description
While working on large CSV aggregations (we're north of 1.8k input files at the moment, ending up grouped in the same destination dataset), it took me a bit of time initially to realize that while load_csv! will raise at the first opportunity if it meets a value not matching the provided dtype, calling cast will silently convert the field to nil and continue (which is also useful, of course).
Here is an up-to-date reproduction to showcase this:
Mix.install([
{:explorer, "~> 0.11.1"}
])
ExUnit.start()
defmodule Repro do
use ExUnit.Case
alias Explorer.DataFrame, as: DF
@incoming_data "field\nABC\n12.4"
test "loading from CSV seems strict" do
assert_raise RuntimeError, ~r/could not parse `ABC` as dtype `f64` at column 'field'/, fn ->
DF.load_csv!(@incoming_data, dtypes: [{:field, {:f, 64}}])
end
end
test "but casting from string, not strict" do
result = @incoming_data
|> DF.load_csv!(dtypes: [{:field, :string}])
|> DF.mutate_with(fn df ->
[field: Explorer.Series.cast(df[:field], {:f, 64})]
end)
|> Access.get(:field)
|> Explorer.Series.to_list()
# non-castable data has been translated to `nil`,
# something which can catch offguard quite a bit
assert result == [nil, 12.4]
end
endCurrent notes from my exploration
- Polars has both strict & non-strict ways of doing things
- the Explorer code-base uses
strict_castin 2 places at least - but it does not expose strictness as an option to the end user currently
- I could not find (so far) mentions of the behaviour (silenceness) in the
castdocumentation - my understanding is that exposing this could be a bit involved (not to mention defaulting to strict if we wanted to)
I thought it would be useful to open a discussion on this, since it could very much take off guard other people (especially in Elixir, where things are usually stricter, & more typing is being introduced).
Metadata
Metadata
Assignees
Labels
No labels