Monocorpus

Utilities for syncing, extracting, and managing a Tatar/Crimean Tatar document corpus. The project connects Yandex.Disk storage, a database, and Yandex Cloud (S3-compatible) buckets to:

synchronize document metadata with a database
extract content and metadata from documents
upload processed artifacts to cloud storage
perform maintenance tasks (public links, deduplication, layout analysis)

Architecture Overview

High-level flow:

Source storage (Yandex.Disk)
Raw documents live in Yandex.Disk under configured entry points.
Database (metadata + state)
Document records track MD5, storage locations, language, metadata, and processing state.
Processing pipelines
- Content extraction (src/content/*) for PDFs and non-PDFs
- Metadata extraction (src/metadata/*) using Gemini prompts
- Optional layout detection (src/experimental/layout/dispatch.py)
Artifact storage (Yandex Cloud S3)
Extracted content, images, and metadata are uploaded to buckets.
Maintenance
Sync, dedup, and public link verification keep the dataset clean and accessible.

In short: Yandex.Disk → DB → extraction → S3, with maintenance tools keeping everything aligned.

Architecture Boundaries

Modules

src/core/: runtime primitives (config, db sessions, paths, encryption, worker state)
src/integrations/: adapters for external systems (Gemini, S3, Yandex Disk)
src/content/: content extraction and postprocessing pipelines
src/metadata/: metadata extraction and applicability evaluation
src/dataset/: dataset assembly pipelines
src/experimental/layout/: layout-specific experimental processing
src/sync/: synchronization workflows and helpers
src/maintenance/: operational and maintenance workflows
src/prompts/: content/metadata prompt templates and helpers
src/cli/: command registration and CLI argument mapping

Import Boundaries

Domain modules should import shared runtime behavior from core/*.
External APIs should be consumed via integrations/*.
Deprecated modules:
- meta_fields -> use metadata.fields
- meta package -> use metadata

Enforcement

make lint runs:

Ruff lint checks
scripts/check_architecture.py boundary checks

Quick Start

Create a virtual environment and install dependencies:

python -m venv .venv
.venv/bin/pip install -r requirements.txt

Create a local config at ~/.monocorpus/config.yaml (see template below).
Run CLI commands via the entrypoint:

python src/main.py --help

Configuration

The project expects a local config file in ~/.monocorpus/config.yaml and a few optional credential files (see below). Keep secrets out of the repo.

Minimal template (fill with your own values):

database_url: "postgresql+psycopg2://USER:PASSWORD@HOST:PORT/DBNAME"
encryption_key: "BASE64_URLSAFE_KEY"

proxy: null

yandex:
  disk:
    oauth_token: "YANDEX_DISK_OAUTH_TOKEN"
    hidden: "/path/segment/used/for/sharing_restricted"
    entry_points:
      tt: "/path/to/tatar/entry_point"
      crh: "/path/to/crimean_tatar/entry_point"
    filtered_out: "/path/to/filtered_out"
  cloud:
    aws_access_key_id: "YANDEX_CLOUD_ACCESS_KEY"
    aws_secret_access_key: "YANDEX_CLOUD_SECRET"
    bucket:
      document: "ttdoc"
      content: "ttcontent"
      content_chunks: "ttcontent_chunks"
      image: "ttimg"
      metadata: "ttmeta"
      upstream_metadata: "ttupstream"

gemini_api_keys:
  - "GEMINI_API_KEY_1"
  - "GEMINI_API_KEY_2"

google_api_key:
  free: "GEMINI_API_KEY_FOR_CLI"

Optional local files (kept out of git):

_artifacts/credentials/client_secret.json and _artifacts/credentials/personal_token.json for Google APIs
any extra tokens required by your workflow

Common Commands

Run all commands via python src/main.py <command>:

sync: sync Yandex.Disk and database, handle filtering and deduplication
extract: extract content from documents (PDF and non-PDF)
meta: extract metadata from documents
hf: assemble structured dataset into parquet
layouts: run layout detection on PDFs
pps: postpostprocess extracted markdown in ~/.monocorpus/1_result and re-upload updated archives
dedup: scan near-full duplicate extracted documents and produce JSON report
match-limited: reconcile limited vs full document variants
sharing-restricted: check restricted sharing docs
check-pub-links: verify/restore public links
dump-state: export database state to CSV and Google Drive/Sheets
upload-to-s3: upload missing Crimean Tatar documents

Use --help for command options:

python src/main.py extract --help

Command Details & Examples

Below are the most commonly used commands and typical flows.

sync
Synchronizes Yandex.Disk with the database, applies filtering rules, and updates links.

python src/main.py sync

extract
Extracts content from documents. Use --md5 or --path to scope work.
--workers controls Gemini parallelism; --batch-size controls queue size.

python src/main.py extract --workers 4
python src/main.py extract --md5 <MD5>
python src/main.py extract --path "/path/in/yadisk"

meta
Extracts structured metadata from documents.

python src/main.py meta

hf
Builds a parquet dataset from extracted content.

python src/main.py hf

layouts
Runs PDF layout detection (YOLO + Surya) and produces annotated outputs.

python src/main.py layouts --md5 <MD5>

check-pub-links
Verifies public links and restores missing ones.

python src/main.py check-pub-links

pps
Runs postpostprocessing on extracted markdown archives in ~/.monocorpus/1_result and re-uploads updated archives to S3.

python src/main.py pps

dedup
Scans extracted archives for near-full duplicate documents and writes a report with recommended keeper documents using format priority (epub > fb2 > docx > pdf).

python src/main.py dedup --threshold 0.98

dump-state
Exports DB state into CSV, ZIP, and Google Sheets/Drive.

python src/main.py dump-state

Workdir Layout

The local workdir is ~/.monocorpus and is organized into subfolders like:

0_entry_point: local copies of documents
1_result: extracted content
2_metadata: extracted metadata
misc/: supporting artifacts (slices, upstream metadata, logs, etc.)

See src/dirs.py for the full list of subdirectories.

Example Layout

~/.monocorpus/
  0_entry_point/
    <md5>.pdf
    <md5>.docx
  1_result/
    <md5>-formatted.md
    <md5>.zip
  2_metadata/
    <md5>.json
  misc/
    doc_slices/
    upstream_metadata/
    page_images/
    clips/
    prompts/
    logs/
  parquet/

Security Notes

Keep secrets in ~/.monocorpus/config.yaml or other local files, not in the repo.
This repository's .gitignore already ignores common secret files, but ensure sensitive files are not committed.

Development Notes

Main CLI entrypoint: src/main.py
Core utilities: src/utils.py
Content pipeline: src/content/*
Metadata pipeline: src/metadata/*
Sync/maintenance: src/sync.py, src/check_pub_links.py, src/match_limited.py

Tests

Run unit tests:

.venv/bin/python -m unittest discover -s tests -v

Run lint checks:

make lint
make lint-fix

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.github		.github
_artifacts		_artifacts
migrations		migrations
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
merged_issue_docs_md5s.txt		merged_issue_docs_md5s.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
todo		todo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monocorpus

Architecture Overview

Architecture Boundaries

Modules

Import Boundaries

Enforcement

Quick Start

Configuration

Common Commands

Command Details & Examples

Workdir Layout

Example Layout

Security Notes

Development Notes

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Monocorpus

Architecture Overview

Architecture Boundaries

Modules

Import Boundaries

Enforcement

Quick Start

Configuration

Common Commands

Command Details & Examples

Workdir Layout

Example Layout

Security Notes

Development Notes

Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages