Skip to content

Series.cast/2 silently converts invalid values to nil #1136

@thbar

Description

@thbar

While working on large CSV aggregations (we're north of 1.8k input files at the moment, ending up grouped in the same destination dataset), it took me a bit of time initially to realize that while load_csv! will raise at the first opportunity if it meets a value not matching the provided dtype, calling cast will silently convert the field to nil and continue (which is also useful, of course).

Here is an up-to-date reproduction to showcase this:

Mix.install([
  {:explorer, "~> 0.11.1"}
])

ExUnit.start()

defmodule Repro do
  use ExUnit.Case
  alias Explorer.DataFrame, as: DF

  @incoming_data "field\nABC\n12.4"

  test "loading from CSV seems strict" do
    assert_raise RuntimeError, ~r/could not parse `ABC` as dtype `f64` at column 'field'/, fn ->
      DF.load_csv!(@incoming_data, dtypes: [{:field, {:f, 64}}])
    end
  end

  test "but casting from string, not strict" do
    result = @incoming_data
    |> DF.load_csv!(dtypes: [{:field, :string}])
    |> DF.mutate_with(fn df ->
      [field: Explorer.Series.cast(df[:field], {:f, 64})]
    end)
    |> Access.get(:field)
    |> Explorer.Series.to_list()

    # non-castable data has been translated to `nil`,
    # something which can catch offguard quite a bit
    assert result == [nil, 12.4]
  end
end

Current notes from my exploration

  • Polars has both strict & non-strict ways of doing things
  • the Explorer code-base uses strict_cast in 2 places at least
  • but it does not expose strictness as an option to the end user currently
  • I could not find (so far) mentions of the behaviour (silenceness) in the cast documentation
  • my understanding is that exposing this could be a bit involved (not to mention defaulting to strict if we wanted to)

I thought it would be useful to open a discussion on this, since it could very much take off guard other people (especially in Elixir, where things are usually stricter, & more typing is being introduced).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions