feat: return sourceUrl for postgres docs by MasterOdin · Pull Request #4 · timescale/pg-aiguide

MasterOdin · 2025-09-10T05:24:31Z

PR adds returning the sourceUrl that a documentation chunk came from. I manually created a source_url column on the docs.postgres table that populates the data.

The data is mostly derived from the header_path, where for any header_path where length == 1, we use the first entry and for any header_path where length > 1, use use the second entry.

Given the version of the chunk, we start with the base url https://www.postgresql.org/docs/${version}/. Given the entry from above, if it matches the regex /\{#([^}]+)\}/, then we take the captured string and append it to our base url. If it doesn't match, then we take the entry and do entry.toLowerCase().replaceAll(' ', '-'). Given this transformation, we then append it to our base url with the .html prefix, and that gives us the url to the documentation page that chunk was gotten from.

For certain pages (e.g. those stemming from "Archive Modules" or "SPI"), then we have to look at the content to see if it matches a specific setup we can determine the URL from. For example, for an "Archive Module" chunk, it's content may start with a code fenced string, where we can plug that into the URL. For example, one such chunk starts with:

ALTER CONVERSION changes the definition of a conversion.

Where taking the code bit, lowercasing it and removing the spaces, and prefix it with sql- to get the actual URL: https://www.postgresql.org/docs/17/sql-alterconversion.html. See below for the full script I've made that gets a lot (but not all) of the chunk URLS. Right now it only does the root page and not the path on the page, but I can adjust it for that later.

URL Generation Script

```js import pg from 'pg';

const pool = new pg.Pool({
connectionString: 'postgres://xxxx',
});

(async () => {
await pool.connect();

const result = await pool.query(SELECT id, header_path, header_depth, version, content FROM docs.postgres WHERE source_url IS NULL AND header_path[1] = 'ECPG --- Embedded SQL in C {#ecpg}' ORDER BY id DESC);

const promises: Promise[] = [];

const failedToFind: number[] = [];

for (const row of result.rows) {
const { id, header_path, header_depth, version, content } = row;
console.log(${id} - ${header_path.join(', ')} - ${version});
let urlPart: string = '';
if (header_path[0] === 'Archive Modules') {
if (header_path.length !== 2 || header_path[1] !== 'Description') {
// These are usually options or parameter chunks from the page, skipping them
// for right now.
continue;
}
const firstLine = content.split('\n', 1)[0].trim();
const firstWord = firstLine.split(' ', 1)[0].replaceAll('', '').toLowerCase(); let searchString = ''; if (firstWord.startsWith('pg_')) { searchString = firstWord; } else { const match = firstLine.match(/(.+?)/); if (match) { searchString = match[1]; } else { searchString = firstLine.split(' ', 1)[0]; } searchString = searchString.toLowerCase(); } if ( !searchString.includes(' ') && ( searchString === 'ecpg' || searchString === 'psql' || searchString === 'postgres' || searchString.startsWith('pg') || searchString.endsWith('db') || searchString.endsWith('user') ) ) { if (await checkUrlPart(version, app-${searchString.replaceAll('', '')})) { urlPart = app-${searchString.replaceAll('', '')}; } else if (await checkUrlPart(version, app-${searchString.replaceAll('', '-')})) { urlPart = app-${searchString.replaceAll('', '-')}; } else if (await checkUrlPart(version, ${searchString.replaceAll('', '')})) { urlPart = ${searchString.replaceAll('', '')}; } else if (await checkUrlPart(version, ${searchString.replaceAll('', '-')})) { urlPart = ${searchString.replaceAll('', '-')}; } else { failedToFind.push(id); continue; } } else { if (await checkUrlPart(version, sql-${searchString.replaceAll(' ', '')})) { urlPart = sql-${searchString.replaceAll(' ', '')}; } else if (await checkUrlPart(version, sql-${searchString.replaceAll(' ', '-')})) { urlPart = sql-${searchString.replaceAll(' ', '-')}; } else if (searchString.includes('operator') && await checkUrlPart(version, sql-${searchString.replace('operator', 'op').replaceAll(' ', '')})) { urlPart = sql-${searchString.replace('operator', 'op').replaceAll(' ', '')}; } else if (searchString.includes('text search')) { searchString = searchString.replace('text search', 'ts'); if (searchString.includes('configuration')) { searchString = searchString.replace('configuration', 'config'); } if (await checkUrlPart(version, sql-${searchString.replaceAll(' ', '')})) { urlPart = sql-${searchString.replaceAll(' ', '')}; } else { failedToFind.push(id); continue; } } else { failedToFind.push(id); continue; } } } else if (header_path[0] === "Server Programming Interface {#spi}" && header_path.length > 1) { if (header_path.length !== 2) { // Don't think there's any chunks like this, but just in case continue; } else if (header_path[1].includes('#')) { const match = header_path[1].match(/\{#([^}]+)\}/); if (!match) { // Unknown format, skip it continue; } urlPart = match[1]; } else if (header_path[1] === 'Description') { const match = content.match(/(.+?)/); if (!match) { continue; } const funcName = match[1].toLowerCase(); if (funcName === 'spi_repalloc') { urlPart = 'spi-realloc'; } else if (await checkUrlPart(version, spi-${funcName.replaceAll('', '-')})) { urlPart = spi-${funcName.replaceAll('', '-')}; } else if (await checkUrlPart(version, spi-${funcName.replaceAll('', '')})) { urlPart = spi-${funcName.replaceAll('', '')}; } else if (funcName.includes('tup') && !funcName.includes('tuple') && await checkUrlPart(version, spi-${funcName.replace('tup', 'tuple').replaceAll('', '-')})) { urlPart = spi-${funcName.replace('tup', 'tuple').replaceAll('', '-')}; } else { failedToFind.push(id); continue; } } else { // Chunks from the description page, skipping for now continue; } } else if (header_path[0] === 'ECPG --- Embedded SQL in C {#ecpg}' && header_path.length > 1) { if (header_path[1].includes('#')) { const match = header_path[1].match(/\{#([^}]+)\}/); if (!match) { // Unknown format, skip it continue; } urlPart = match[1]; } else { if (header_path[1] !== 'Description') { // Chunks from the description page, skipping for now continue; } const match = content.match(/(.+?)/); // The WHENEVER page is the only that doesn't have the keywords in backticks const stringMatch = match ? match[1].toLowerCase() : 'whenever'; if (await checkUrlPart(version, ecpg-sql-${stringMatch.replaceAll(' ', '-')})) { urlPart = ecpg-sql-${stringMatch.replaceAll(' ', '-')}`;
} else {
failedToFind.push(id);
continue;
}
}
} else {
const header = header_depth > 1 ? header_path[1]: header_path[0];
const match = header.match(/{#([^}]+)}/);
urlPart = match ? match[1] : header.toLowerCase().replaceAll(' ', '-');
}

const url = `https://www.postgresql.org/docs/${version}/${urlPart}.html`;
console.log('  -> ', url);
promises.push(
  pool.query(
    'UPDATE docs.postgres SET source_url = $1 WHERE id = $2', [url, id]
  ).then(() => {
    // Noop
  }).catch((err) => {
    console.error(err);
  }),
);

}

await Promise.allSettled(promises);

if (failedToFind.length > 0) {
console.log(Failed to find URL parts for IDs (${failedToFind.length}):);
console.log(JSON.stringify(failedToFind));
}
})()
.then(() => {
console.log('Done');
process.exit(0);
})
.catch((err) => {
console.error('Error:', err);
process.exit(1);
})
.finally(() => {
pool.end();
});

</details>

Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>

MasterOdin and others added 2 commits September 9, 2025 23:18

feat: return sourceUrl for postgres docs

0f4d110

Update semanticSearchPostgresDocs.ts

c8d3169

Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>

MasterOdin merged commit 1e54c34 into main Sep 11, 2025
3 checks passed

MasterOdin deleted the mpeveler/feat-postgres-source-rul branch September 11, 2025 16:09

MasterOdin mentioned this pull request Sep 17, 2025

feat: add postgres ingest #6

Merged

MasterOdin added a commit that referenced this pull request Sep 23, 2025

feat: return sourceUrl for postgres docs (#4)

c304499

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: return sourceUrl for postgres docs#4

feat: return sourceUrl for postgres docs#4
MasterOdin merged 2 commits intomainfrom
mpeveler/feat-postgres-source-rul

MasterOdin commented Sep 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MasterOdin commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MasterOdin commented Sep 10, 2025 •

edited

Loading