-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Labels
datasomething about quality or quantity of ingested datasomething about quality or quantity of ingested datagood first issueGood for newcomersGood for newcomersperformance
Milestone
Description
We're currently wasting ~3GB due to the extreme duplication of homepages and descriptions. This is a textbook case for normalisation:
duplicates = NixDerivationMeta.objects.values("homepage").annotate(count=Count("id"), length=Length("homepage")).filter(count__gt=1).order_by("-count")
wasted_bytes = duplicates.aggregate(total=Sum((F('count') - 1) * F('length')))['total']
wasted_bytes / 1024 / 1024
=> 1190.5588636398315
duplicates.filter(count__gt=1000).count()
=> 3966
duplicates.values_list("homepage","count")[:10]
=>
(None, 1800158),
('https://home-assistant.io/', 223107),
('https://www.qt.io', 63831),
('https://kde.org', 60266),
('http://www.kde.org', 50180),
('https://clang.llvm.org/', 47748),
('https://www.qt.io/', 44837),
('https://github.com/nltk/nltk_data', 33216),
(None, 31517), # ???
('https://www.nvidia.com/object/unix.html', 27553)
duplicates = NixDerivationMeta.objects.values("description").annotate(count=Count("id"), length=Length("description")).filter(count__gt=1).order_by("-count")
wasted_bytes = duplicates.aggregate(total=Sum((F('count') - 1) * F('length')))['total']
wasted_bytes / 1024 / 1024
=> 1450.313220024109
duplicates.filter(count__gt=1000).count()
=> 1748
duplicates.values_list("Description","count")[:10]
=>
(None, 2462615),
('Open source home automation that puts local control and privacy first', 219548),
('Cross-platform application framework for C++', 105133),
('C language family frontend for LLVM (wrapper script)', 46109),
('NLTK Data', 33216),
('The default build environment for Unix packages in Nixpkgs', 30324),
('X.org driver and kernel module for NVIDIA cards', 26381),
('Android SDK tools, packaged in Nixpkgs', 23946),
('Glasgow Haskell Compiler', 23114),
('NVIDIA Linux Open GPU Kernel Module', 22590)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
datasomething about quality or quantity of ingested datasomething about quality or quantity of ingested datagood first issueGood for newcomersGood for newcomersperformance