Skip to content

CSV.jl "corrupts" data when a field is very large #1171

@sylvaticus

Description

@sylvaticus

I have a CSV file (43GB original version from here and here a reduced 10MB, 10 rows version, produced with head -n10) where one field is a MULTIPOLYGON that can be very large, and when I import it with CSV.jl (even a smaller 10 rows version) I got a strange corruption:

julia> data    = CSV.read("MYRIAD-HES/test.csv",DataFrames.DataFrame;delim=',',quotechar='\"')
9×8 DataFrame
 Row │ Event    Hazard        code      starttime   endtime     Intensity  Unit          Geometry                          
     │ String7  String15      String15  Date        Date        Float64    String15      String                            
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ event0   heatwave      hw31      2004-01-04  2004-01-06  294.0      Kelvin        MULTIPOLYGON (((10.992 11.493, 1…
   2 │ event0   wildfire      wf481347  2004-01-06  2004-01-17    8.46256  Area          POLYGON ((11.161 10.799, 11.157 …
   3 │ event1   heatwave      hw18      2004-01-04  2004-01-06  295.0      Kelvin        POLYGON ((4.497 13.49, 4.497 12.…
   4 │ event1   wildfire      wf479859  2004-01-05  2004-01-17  104.733    Area          POLYGON ((4.388 13.44, 4.383 13.…
   5 │ event2   coldwave      cw56      2004-01-04  2004-01-11  228.0      Kelvin        MULTIPOLYGON (((-69.951 47.182, …
   6 │ event2   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  LTIPOLYGON (((-91.514 34.569, -9…
   7 │ event3   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  MULTIPOLYGON (((-91.514 34.569, …
   8 │ event3   extreme wind  ew15      2004-01-07  2004-01-07   19.0      m/s           POLYGON ((-83.442 44.935, -83.44…
   9 │ event4   flood         fl0       2004-01-04  2004-01-16    1.0      DFO severity  MULTIPOLYGON (((-91.514 34.569, …

julia> data[6,"Geometry"]
"LTIPOLYGON (((-91.514 34.569, -91.514 34.567, -91.516 34.567, -91.516 34.569, -91.514 34.569)), ((-91.518 34.569, -91.518 34.572, -91.516 34.572, -91.516 34.569, -91.518 34.569)), ((-91.505 34.576, -91.507 34.576, -91.507 34.578, -91.505 34.578, -91.505 34.576)), ((-91.502 34.581, -91.5 34.581, -91.5 34.578, -91.505 34.578, -91.505 34.583, -91.507 34.583, -91.507 34.585, -91.509 34.585, -91.509 34.587, -91.507 34.587, -91.507 34.59, -91.505 34.59, -91.505 34.587, -91.502 34.587, -91.502 34.581)), ((-91.493 34.635, -91.487 34.635, -91.487 34.632, -91.505 34.632, -91.505 34.635, -91.502 34.635, -91.502 34.637, -91.493 34.637, -91.493 34.635)), ((-91.498 34.655, -91.498 34.653, -91.5 34.653, -91.5 34.655, -91.498 34.655)), ((-91.502 34.659, -91.502 34.657, -91.505 34.657, -91.505 34.659, -91.502 34.659)), ((-91" ⋯ 429049 bytes ⋯ "8.488, -89.479 38.491, -89.474 38.491, -89.474 38.493, -89.477 38.493, -89.477 38.495, -89.472 38.495, -89.472 38.497, -89.474 38.497, -89.474 38.5, -89.468 38.5, -89.468 38.502, -89.461 38.502, -89.461 38.5, -89.454 38.5, -89.454 38.497, -89.45 38.497, -89.45 38.495, -89.461 38.495, -89.461 38.493, -89.463 38.493, -89.463 38.488, -89.477 38.488, -89.477 38.486, -89.47 38.486, -89.47 38.484, -89.477 38.484, -89.477 38.479, -89.479 38.479, -89.479 38.482)), ((-89.486 38.488, -89.486 38.491, -89.481 38.491, -89.481 38.488, -89.486 38.488)), ((-89.593 38.493, -89.596 38.493, -89.596 38.495, -89.593 38.495, -89.593 38.493)), ((-89.445 38.495, -89.447 38.495, -89.447 38.497, -89.445 38.497, -89.445 38.495)), ((-89.418 38.5, -89.418 38.497, -89.421 38.497, -89.421 38.5, -89.418 38.5)), ((-89.829 38.504, -89.834 3"

Note the "LTIPOLYGON" instead of "MULTIPOLYGON"
I explorer the input csv file and it seems correct. Also, it works with CSVFiles.jl

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions