Skip to content

Commit 482a187

Browse files
authored
Improve performance of reading files with duplicate column names (#955)
* Add tests for duplicate column name handling In the next commit, we'll be changing the code responsible for naming duplicate columns and these tests should ensure that the behavior doesn't change. * Improve performance of reading files with duplicate column names I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because makeunique(), which produces unique column names, has cubic complexity. This commit changes the algorithm to use a Dict to quickly look up the existence of columns and to cache the next numeric suffix used to uniquify column names. Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the previous commit.
1 parent d25992a commit 482a187

File tree

2 files changed

+16
-3
lines changed

2 files changed

+16
-3
lines changed

src/utils.jl

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -349,17 +349,20 @@ function makeunique(names)
349349
set = Set(names)
350350
length(set) == length(names) && return Symbol[Symbol(x) for x in names]
351351
nms = Symbol[]
352+
nextsuffix = Dict{eltype(names), UInt}()
352353
for nm in names
353-
if nm in nms
354-
k = 1
354+
if haskey(nextsuffix, nm)
355+
k = nextsuffix[nm]
355356
newnm = Symbol("$(nm)_$k")
356-
while newnm in set || newnm in nms
357+
while newnm in set || haskey(nextsuffix, newnm)
357358
k += 1
358359
newnm = Symbol("$(nm)_$k")
359360
end
361+
nextsuffix[nm] = k + 1
360362
nm = newnm
361363
end
362364
push!(nms, nm)
365+
nextsuffix[nm] = 1
363366
end
364367
@assert length(names) == length(nms)
365368
return nms

test/basics.jl

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -748,4 +748,14 @@ f = CSV.File(IOBuffer("a,b\n1,2\n3,"))
748748
@test f.a == [1, 3]
749749
@test isequal(f.b, [2, missing])
750750

751+
# duplicate column names
752+
f = CSV.File(IOBuffer("a,a,a\n"))
753+
@test f.names == [:a, :a_1, :a_2]
754+
755+
f = CSV.File(IOBuffer("a,a_1,a\n"))
756+
@test f.names == [:a, :a_1, :a_2]
757+
758+
f = CSV.File(IOBuffer("a,a,a_1\n")) # this case is not covered in test_duplicate_columnnames.csv
759+
@test f.names == [:a, :a_2, :a_1]
760+
751761
end

0 commit comments

Comments
 (0)