Commit 482a187
authored
Improve performance of reading files with duplicate column names (#955)
* Add tests for duplicate column name handling
In the next commit, we'll be changing the code responsible for naming
duplicate columns and these tests should ensure that the behavior
doesn't change.
* Improve performance of reading files with duplicate column names
I need to load a file with 30k columns, 10k of these have the same
name. Currently, this is practically impossible because makeunique(),
which produces unique column names, has cubic complexity.
This commit changes the algorithm to use a Dict to quickly look up the
existence of columns and to cache the next numeric suffix used to
uniquify column names.
Care has been taken to ensure that columns are named the same way as
before. To that extent, additional tests were added in the previous
commit.1 parent d25992a commit 482a187
2 files changed
+16
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
349 | 349 | | |
350 | 350 | | |
351 | 351 | | |
| 352 | + | |
352 | 353 | | |
353 | | - | |
354 | | - | |
| 354 | + | |
| 355 | + | |
355 | 356 | | |
356 | | - | |
| 357 | + | |
357 | 358 | | |
358 | 359 | | |
359 | 360 | | |
| 361 | + | |
360 | 362 | | |
361 | 363 | | |
362 | 364 | | |
| 365 | + | |
363 | 366 | | |
364 | 367 | | |
365 | 368 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
748 | 748 | | |
749 | 749 | | |
750 | 750 | | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
751 | 761 | | |
0 commit comments