Skip to content

Deduplicate derivation metadata #790

@fricklerhandwerk

Description

@fricklerhandwerk

We're currently wasting ~3GB due to the extreme duplication of homepages and descriptions. This is a textbook case for normalisation:

duplicates = NixDerivationMeta.objects.values("homepage").annotate(count=Count("id"), length=Length("homepage")).filter(count__gt=1).order_by("-count")
wasted_bytes = duplicates.aggregate(total=Sum((F('count') - 1) * F('length')))['total']
wasted_bytes / 1024 / 1024
=> 1190.5588636398315
duplicates.filter(count__gt=1000).count()
=> 3966
duplicates.values_list("homepage","count")[:10]
=> 
(None, 1800158),
('https://home-assistant.io/', 223107),
('https://www.qt.io', 63831),
('https://kde.org', 60266),
('http://www.kde.org', 50180),
('https://clang.llvm.org/', 47748),
('https://www.qt.io/', 44837),
('https://github.com/nltk/nltk_data', 33216),
(None, 31517), # ???
('https://www.nvidia.com/object/unix.html', 27553)
duplicates = NixDerivationMeta.objects.values("description").annotate(count=Count("id"), length=Length("description")).filter(count__gt=1).order_by("-count")
wasted_bytes = duplicates.aggregate(total=Sum((F('count') - 1) * F('length')))['total']
wasted_bytes / 1024 / 1024
=> 1450.313220024109
duplicates.filter(count__gt=1000).count()
=> 1748
duplicates.values_list("Description","count")[:10]
=>
(None, 2462615),
('Open source home automation that puts local control and privacy first', 219548),
('Cross-platform application framework for C++', 105133),
('C language family frontend for LLVM (wrapper script)', 46109),
('NLTK Data', 33216),
('The default build environment for Unix packages in Nixpkgs', 30324),
('X.org driver and kernel module for NVIDIA cards', 26381),
('Android SDK tools, packaged in Nixpkgs', 23946),
('Glasgow Haskell Compiler', 23114),
('NVIDIA Linux Open GPU Kernel Module', 22590)

Metadata

Metadata

Assignees

No one assigned

    Labels

    datasomething about quality or quantity of ingested datagood first issueGood for newcomersperformance

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions