Skip to content

Deduplicate Scratch.contextΒ #1739

@mkst

Description

@mkst

I asked our pal, ChatGPT, to give me a script to be able to determine whether it makes sense to try and de-duplicate contexts. TL;DR: yes, it'll save ~15GB of space (uncompressed).

Script

import hashlib
from django.db import transaction
from coreapp.models.scratch import Scratch  # replace with your actual app/model name

# --- Config ---
context_field = 'context'   # the TextField name that holds your shared text
chunk_size = 1000             # how many rows to process per DB fetch
# ---------------

stats = {}
total_bytes = 0
row_count = 0

print("Scanning contexts...")

# use iterator() to stream rows without loading everything into memory
with transaction.atomic():
    for s in Scratch.objects.only(context_field).iterator(chunk_size=chunk_size):
        text = getattr(s, context_field, None)
        if not text:
            continue
        b = text.encode('utf-8')
        size = len(b)
        h = hashlib.sha256(b).hexdigest()
        total_bytes += size
        row_count += 1
        entry = stats.get(h)
        if entry:
            entry['count'] += 1
        else:
            stats[h] = {'size': size, 'count': 1}

# compute totals
unique_bytes = sum(v['size'] for v in stats.values())
dedup_savings = 1 - (unique_bytes / total_bytes) if total_bytes else 0

print("\n--- Deduplication Estimate ---")
print(f"Rows scanned:       {row_count:,}")
print(f"Unique contexts:    {len(stats):,}")
print(f"Total raw size:     {total_bytes / 1_000_000:.2f} MB")
print(f"Unique text size:   {unique_bytes / 1_000_000:.2f} MB")
print(f"Potential savings:  {dedup_savings * 100:.2f}%")

Results (on my laptop which has a slightly older version of the db (158k scratches not the current 196k):

Rows scanned:       158,225
Unique contexts:    68,473
Total raw size:     33902.49 MB
Unique text size:   16119.61 MB
Potential savings:  52.45%

In terms of implementation, we should have a separate table for Contexts. Each scratch will point to an entry in this table. If a user modifies the context we will check if that modified version already exists, otherwise we'll create a new Context.

This means we should update the housekeeping script to cleanup un-referenced contexts (the same way it clears out ownerless scratches and duff profiles).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions