Skip to content

feat: add postgres ingest#6

Merged
MasterOdin merged 5 commits intomainfrom
mpeveler/feat-postgres-ingest
Sep 17, 2025
Merged

feat: add postgres ingest#6
MasterOdin merged 5 commits intomainfrom
mpeveler/feat-postgres-ingest

Conversation

@MasterOdin
Copy link
Contributor

@MasterOdin MasterOdin commented Sep 16, 2025

PR adds a ingest/postgres_docs.py script that:

  1. Clones the https://github.com/postgres/postgres repo locally if it doesn't exist
  2. For each version of (14, 15, 16, 17):
    1. Runs ./configure for the version
    2. Runs make html to build HTML pages of the site
    3. Converts each generate HTML file to markdown using markdownify (saved to ingest/build/md)
    4. Create docs.postgres_pages_tmp and docs.postgres_chunks_tmp tables
    5. For each markdown file:
      1. Insert an entry into docs.postgres_pages_tmp
      2. chunk it based on header, and if the chunk is too large splitting it into 7000 token chunks
      3. insert each chunk into `docs.postgres_chunks_tmp
    6. Rename docs.postgres_pages_tmp to docs.postgres_pages and docs.postgres_chunks_tmp to docs.postgres_chunks

The structure of the two tables:

CREATE TABLE docs.postgres_pages (
  id int4 PRIMARY KEY generated by default as identity
  , version int2 NOT NULL
  , url TEXT UNIQUE NOT NULL
  , domain TEXT NOT NULL
  , filename TEXT NOT NULL
  , content_length INTEGER
  , scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
  , chunking_method TEXT DEFAULT 'header'
  , chunks_count INTEGER DEFAULT 0
);

CREATE TABLE IF NOT EXISTS docs.postgres_chunks (
  id int4 PRIMARY KEY generated by default as identity
  , page_id INTEGER REFERENCES docs.postgres_pages(id) ON DELETE CASCADE
  , chunk_index INTEGER NOT NULL
  , sub_chunk_index INTEGER NOT NULL DEFAULT 0
  , content TEXT NOT NULL
  , metadata JSONB
  , embedding vector(1536)
  , created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Creation of the base tables will be handled by #3 once that's ready to be merged, however for now focusing on just getting the ingest pipelines done, and then will return to that.

I had originally started with the approach taken in https://github.com/timescale/pg-rag, however it had the following issues:

  1. It wasn't actually building the docs for 14 and 15 (they lacked target for postgres-full.xml)
  2. The postgres-full.xml -> postgres-full.md conversion left with bad headers for the ref entries causing them to be chunked poorly.
  3. Getting the proper source_url for each chunk was an annoyingly fraught process (see feat: return sourceUrl for postgres docs #4 for the script I made, which still left about a third of entries without a url), whereas by processing individual HTML pages it was super easy.

As part of this PR, I've intentionally omitted pyproject.toml, README.md, etc. where I plan to add those later. If you wish to run this locally, you will need to edit the root .env file to have PG... variables set to connect to the database, and then the dependencies are:

dependencies = [
    "beautifulsoup4>=4.13.5",
    "markdownify>=1.1.0",
    "openai>=1.97.1",
    "psycopg[binary,pool]>=3.2.9",
    "python-dotenv[cli]>=1.1.1",
    "tiktoken>=0.11.0",
]

Signed-off-by: Matthew Peveler <mpeveler@tigerdata.com>
Signed-off-by: Matthew Peveler <mpeveler@tigerdata.com>
Signed-off-by: Matthew Peveler <mpeveler@tigerdata.com>
@MasterOdin MasterOdin requested a review from jgpruitt September 17, 2025 00:01
Signed-off-by: Matthew Peveler <mpeveler@tigerdata.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a requirements.txt or a pyproject.toml here to make it easy to pull dependencies and run this, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intended to be run on the host and not in a docker container, right? Can we have a readme with prereqs? I used to run this process in docker b/c I didn't want all that stuff installed on my mac.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could easily be either. I plan to add a README in a follow-up PR that covers both this and #7. I can also look to add a Dockerfile to run these as well.

Signed-off-by: Matthew Peveler <mpeveler@tigerdata.com>
@MasterOdin MasterOdin merged commit 84a6d79 into main Sep 17, 2025
3 of 4 checks passed
@MasterOdin MasterOdin deleted the mpeveler/feat-postgres-ingest branch September 17, 2025 20:53
MasterOdin added a commit that referenced this pull request Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants