Utilities for syncing, extracting, and managing a Tatar/Crimean Tatar document corpus. The project connects Yandex.Disk storage, a database, and Yandex Cloud (S3-compatible) buckets to:
- synchronize document metadata with a database
- extract content and metadata from documents
- upload processed artifacts to cloud storage
- perform maintenance tasks (public links, deduplication, layout analysis)
High-level flow:
-
Source storage (Yandex.Disk)
Raw documents live in Yandex.Disk under configured entry points. -
Database (metadata + state)
Document records track MD5, storage locations, language, metadata, and processing state. -
Processing pipelines
- Content extraction (
src/content/*) for PDFs and non-PDFs - Metadata extraction (
src/metadata/*) using Gemini prompts - Optional layout detection (
src/experimental/layout/dispatch.py)
- Content extraction (
-
Artifact storage (Yandex Cloud S3)
Extracted content, images, and metadata are uploaded to buckets. -
Maintenance
Sync, dedup, and public link verification keep the dataset clean and accessible.
In short: Yandex.Disk → DB → extraction → S3, with maintenance tools keeping everything aligned.
src/core/: runtime primitives (config, db sessions, paths, encryption, worker state)src/integrations/: adapters for external systems (Gemini, S3, Yandex Disk)src/content/: content extraction and postprocessing pipelinessrc/metadata/: metadata extraction and applicability evaluationsrc/dataset/: dataset assembly pipelinessrc/experimental/layout/: layout-specific experimental processingsrc/sync/: synchronization workflows and helperssrc/maintenance/: operational and maintenance workflowssrc/prompts/: content/metadata prompt templates and helperssrc/cli/: command registration and CLI argument mapping
- Domain modules should import shared runtime behavior from
core/*. - External APIs should be consumed via
integrations/*. - Deprecated modules:
meta_fields-> usemetadata.fieldsmetapackage -> usemetadata
make lint runs:
- Ruff lint checks
scripts/check_architecture.pyboundary checks
- Create a virtual environment and install dependencies:
python -m venv .venv
.venv/bin/pip install -r requirements.txt-
Create a local config at
~/.monocorpus/config.yaml(see template below). -
Run CLI commands via the entrypoint:
python src/main.py --helpThe project expects a local config file in ~/.monocorpus/config.yaml and a few optional
credential files (see below). Keep secrets out of the repo.
Minimal template (fill with your own values):
database_url: "postgresql+psycopg2://USER:PASSWORD@HOST:PORT/DBNAME"
encryption_key: "BASE64_URLSAFE_KEY"
proxy: null
yandex:
disk:
oauth_token: "YANDEX_DISK_OAUTH_TOKEN"
hidden: "/path/segment/used/for/sharing_restricted"
entry_points:
tt: "/path/to/tatar/entry_point"
crh: "/path/to/crimean_tatar/entry_point"
filtered_out: "/path/to/filtered_out"
cloud:
aws_access_key_id: "YANDEX_CLOUD_ACCESS_KEY"
aws_secret_access_key: "YANDEX_CLOUD_SECRET"
bucket:
document: "ttdoc"
content: "ttcontent"
content_chunks: "ttcontent_chunks"
image: "ttimg"
metadata: "ttmeta"
upstream_metadata: "ttupstream"
gemini_api_keys:
- "GEMINI_API_KEY_1"
- "GEMINI_API_KEY_2"
google_api_key:
free: "GEMINI_API_KEY_FOR_CLI"Optional local files (kept out of git):
_artifacts/credentials/client_secret.jsonand_artifacts/credentials/personal_token.jsonfor Google APIs- any extra tokens required by your workflow
Run all commands via python src/main.py <command>:
sync: sync Yandex.Disk and database, handle filtering and deduplicationextract: extract content from documents (PDF and non-PDF)meta: extract metadata from documentshf: assemble structured dataset into parquetlayouts: run layout detection on PDFspps: postpostprocess extracted markdown in~/.monocorpus/1_resultand re-upload updated archivesdedup: scan near-full duplicate extracted documents and produce JSON reportmatch-limited: reconcile limited vs full document variantssharing-restricted: check restricted sharing docscheck-pub-links: verify/restore public linksdump-state: export database state to CSV and Google Drive/Sheetsupload-to-s3: upload missing Crimean Tatar documents
Use --help for command options:
python src/main.py extract --helpBelow are the most commonly used commands and typical flows.
sync
Synchronizes Yandex.Disk with the database, applies filtering rules, and updates links.
python src/main.py syncextract
Extracts content from documents. Use --md5 or --path to scope work.
--workers controls Gemini parallelism; --batch-size controls queue size.
python src/main.py extract --workers 4
python src/main.py extract --md5 <MD5>
python src/main.py extract --path "/path/in/yadisk"meta
Extracts structured metadata from documents.
python src/main.py metahf
Builds a parquet dataset from extracted content.
python src/main.py hflayouts
Runs PDF layout detection (YOLO + Surya) and produces annotated outputs.
python src/main.py layouts --md5 <MD5>check-pub-links
Verifies public links and restores missing ones.
python src/main.py check-pub-linkspps
Runs postpostprocessing on extracted markdown archives in ~/.monocorpus/1_result
and re-uploads updated archives to S3.
python src/main.py ppsdedup
Scans extracted archives for near-full duplicate documents and writes a report with
recommended keeper documents using format priority (epub > fb2 > docx > pdf).
python src/main.py dedup --threshold 0.98dump-state
Exports DB state into CSV, ZIP, and Google Sheets/Drive.
python src/main.py dump-stateThe local workdir is ~/.monocorpus and is organized into subfolders like:
0_entry_point: local copies of documents1_result: extracted content2_metadata: extracted metadatamisc/: supporting artifacts (slices, upstream metadata, logs, etc.)
See src/dirs.py for the full list of subdirectories.
~/.monocorpus/
0_entry_point/
<md5>.pdf
<md5>.docx
1_result/
<md5>-formatted.md
<md5>.zip
2_metadata/
<md5>.json
misc/
doc_slices/
upstream_metadata/
page_images/
clips/
prompts/
logs/
parquet/
- Keep secrets in
~/.monocorpus/config.yamlor other local files, not in the repo. - This repository's
.gitignorealready ignores common secret files, but ensure sensitive files are not committed.
- Main CLI entrypoint:
src/main.py - Core utilities:
src/utils.py - Content pipeline:
src/content/* - Metadata pipeline:
src/metadata/* - Sync/maintenance:
src/sync.py,src/check_pub_links.py,src/match_limited.py
Run unit tests:
.venv/bin/python -m unittest discover -s tests -vRun lint checks:
make lint
make lint-fix