This page is an advanced user guide for jobcurator.
- For installation, basic usage, and backend comparison → see
README.md. - Here we focus on:
- how the objects (
Job,Category, etc.) fit together, and - how to build an incremental deduplication pipeline over time (batches, SQL or local files).
- how the objects (
A Job is the main unit you pass to JobCurator.
Typical fields:
-
Required
id: str– unique identifier for the job
-
Content / context (optional but recommended)
title: str | Nonetext: str | Nonecategories: dict[str, list[Category]]location: Location3DField | Nonesalary: SalaryField | Nonecompany: str | Nonecontract_type: str | Nonesource: str | Nonecreated_at: datetime | None
-
Computed by
JobCuratorlength_score: floatcompletion_score_val: floatquality: float← used for ranking / selectionexact_hash: intsignature: int← 128-bit composite hash, used for diversity (Hamming)
You create Job objects, then JobCurator enriches them with quality + hash metadata.
Represents hierarchical category information (multi-level taxonomy).
Fields:
id: strlabel: strlevel: int– depth (0 = root)parent_id: str | Nonelevel_path: list[str]– e.g.["Engineering", "Software", "Backend"]
A job can have multiple category dimensions at once:
job.categories = {
"job_function": [
Category(
id="backend",
label="Backend",
level=2,
parent_id="software",
level_path=["Engineering", "Software", "Backend"],
)
],
"industry": [
Category(
id="saas",
label="SaaS",
level=1,
parent_id="tech",
level_path=["Technology", "SaaS"],
)
],
}These categories are used in the hashing process (meta-hash, MinHash, FAISS vectors).
A location with 3D coordinates for proper geo distance:
-
Inputs
lat: float– latitude (degrees)lon: float– longitude (degrees)alt_m: float– altitude in meters (optional)city: str | Nonecountry_code: str | None
-
Internal
x, y, z: float– Earth-centered 3D coordinates (computed once)
JobCurator uses these coordinates to:
- avoid merging jobs that are too far apart,
- or to add location into feature vectors (e.g. FAISS backend).
Structured salary information:
min_value: float | Nonemax_value: float | Nonecurrency: str– e.g."EUR","USD"period: str– e.g."year","month"
Salary can be bucketized and used in the hashing / meta-hash steps.
JobCurator works entirely in memory on a list of Job objects:
-
Scores quality Combines length, completion, freshness, etc. into a single
qualityscore per job. -
Computes hashes & signatures
- exact hash (to remove strict duplicates)
- SimHash / MinHash / FAISS vector signatures depending on backend
-
Clusters similar jobs Using LSH, MinHash, FAISS, etc.
-
Selects a subset (compression) Respects:
ratio(e.g. keep 50%),alpha(quality vs diversity trade-off),- cluster-level pooling (
max_per_cluster_in_pool).
Canonical call:
compressed_jobs = curator.dedupe_and_compress(jobs)In many real-world setups:
- jobs arrive in batches (e.g.
jobs1,jobs2, …), - you want to dedupe against past batches,
- and you want to maintain a global compressed view over time.
The incremental strategy uses three additional pieces:
-
CuckooFilter
- A compact “seen set” of exact hashes.
- Lets you check “have we already seen something that looks exactly like this job?”
- Updated each time you process a batch.
-
StoreDB interface
-
Abstracts where compressed jobs and the CuckooFilter are stored.
-
There are ready-made implementations for:
- SQL (
SqlStoreDB) - local files (
LocalFileStoreDB)
- SQL (
-
-
Helpers for incremental flows
process_batch(store, jobs, curator)global_reselect_in_store(store, ratio, alpha)
These live under:
from jobcurator.storage import (
StoreDB,
SqlStoreDB,
LocalFileStoreDB,
process_batch,
global_reselect_in_store,
)Conceptually, StoreDB is:
“Anything that can store compressed jobs + one global CuckooFilter, and can list minimal per-job metadata when we want to rebalance.”
It needs to support:
-
CuckooFilter state
load_or_create_cuckoo(capacity) -> CuckooFiltersave_cuckoo(cf) -> None
-
Compressed jobs
insert_compressed_jobs(compressed_jobs, backend)load_all_light_jobs() -> list[LightJob]overwrite_with_selected(selected_ids)
The algorithmic core only needs, for each job:
idqualitysignature
Everything else (title, text, company, location, etc.) is for your own business needs.
Used for each new batch of raw jobs:
from jobcurator.storage import process_batch
compressed_jobsN = process_batch(
store=my_store_db, # SqlStoreDB or LocalFileStoreDB
jobs=jobsN,
curator=my_curator,
)What happens:
-
Load or create a global CuckooFilter from the store.
-
Run
curator.dedupe_and_compress(jobsN, seen_filter=cuckoo_filter):- dedup + compress inside the batch,
- drop jobs that seem already seen (exact hash), based on previous batches.
-
Insert the resulting
compressed_jobsNinto storage. -
Save the updated CuckooFilter back to the store.
Result:
- You can process
jobs1,jobs2,jobs3, … in order, - without ever having to reload all previous compressed jobs into memory,
- while still avoiding re-inserted duplicates across batches.
Over time, you may want to:
- keep only a certain fraction of all compressed jobs (e.g. 50%),
- while preserving diversity and quality across all batches.
You can use:
from jobcurator.storage import global_reselect_in_store
global_reselect_in_store(
store=my_store_db,
ratio=0.5, # keep ~50% of stored compressed jobs
alpha=0.6, # trade-off between quality and diversity
)What happens:
-
store.load_all_light_jobs()returns a list of light objects (id, quality, signature). -
A global greedy selection is run:
- same quality + diversity logic as in
JobCurator.
- same quality + diversity logic as in
-
store.overwrite_with_selected(selected_ids)keeps only those jobs in storage.
This gives you a globally consistent compressed set over multiple batches:
- same similarity notion (Hamming on
signature), - same
alphatrade-off, - but applied to everything you have stored, not just one batch.
from jobcurator import JobCurator
from jobcurator.storage import SqlStoreDB, process_batch, global_reselect_in_store
import psycopg2
# 1) Connect to your database
conn = psycopg2.connect("dbname=... user=... password=... host=...")
# 2) Choose a storage implementation
store = SqlStoreDB(conn)
# 3) Configure JobCurator
curator = JobCurator(
backend="default_hash",
ratio=0.5,
alpha=0.6,
max_per_cluster_in_pool=3,
d_sim_threshold=20,
max_cluster_distance_km=50.0,
use_multiprobe=True,
)
# 4) Process batches incrementally
compressed_jobs1 = process_batch(store, jobs1, curator)
compressed_jobs2 = process_batch(store, jobs2, curator)
compressed_jobs3 = process_batch(store, jobs3, curator)
# ...
# 5) Periodically rebalance globally
global_reselect_in_store(store, ratio=0.5, alpha=0.6)Test with SQL storage (Postgres):
python3 test_incremental.py \
--backend default_hash \
--ratio 0.5 \
--alpha 0.6 \
--storage sql \
--dsn "dbname=mydb user=myuser password=mypass host=localhost port=5432" \
--batches 3 \
--n-per-batch 30 \
# --no-global-reselect # optionalfrom jobcurator import JobCurator
from jobcurator.storage import LocalFileStoreDB, process_batch, global_reselect_in_store
# 1) Use the local file-based store
store = LocalFileStoreDB() # defaults to ./data/compressed_jobs.jsonl, ./data/cuckoo_filter.pkl
# 2) Configure JobCurator as usual
curator = JobCurator(
backend="default_hash",
ratio=0.5,
alpha=0.6,
max_per_cluster_in_pool=3,
d_sim_threshold=20,
max_cluster_distance_km=50.0,
use_multiprobe=True,
)
# 3) Process incoming batches
compressed_jobs1 = process_batch(store, jobs1, curator)
compressed_jobs2 = process_batch(store, jobs2, curator)
# 4) Periodic global cleanup / rebalancing
global_reselect_in_store(store, ratio=0.5, alpha=0.6)Test with local storage:
python3 test_incremental.py \
--backend default_hash \
--ratio 0.5 \
--alpha 0.6 \
--storage local \
--dsn "" \
--batches 3 \
--n-per-batch 20 \
--clear-local \
# --no-global-reselect # (optional) add this flag if you want to skip final global rebalancingYou should consider the incremental pipeline if:
-
Jobs arrive continuously (daily/hourly feeds).
-
You want to avoid reprocessing or reloading all historical jobs.
-
You need a bounded global set of compressed jobs with:
- controlled compression ratio,
- stable diversity,
- and consistent quality scoring.
If you just want to dedupe one big static snapshot once, you can call:
compressed_jobs = curator.dedupe_and_compress(jobs)directly and ignore the incremental API.
For long-running production feeds, the combination of:
JobCurator(in-memory dedup/compression),CuckooFilter(seen set),StoreDB(persistence),process_batch+global_reselect_in_store
gives you a clean, reusable pattern to scale over time.