Skip to content

feat: return sourceUrl for postgres docs#4

Merged
MasterOdin merged 2 commits intomainfrom
mpeveler/feat-postgres-source-rul
Sep 11, 2025
Merged

feat: return sourceUrl for postgres docs#4
MasterOdin merged 2 commits intomainfrom
mpeveler/feat-postgres-source-rul

Conversation

@MasterOdin
Copy link
Contributor

@MasterOdin MasterOdin commented Sep 10, 2025

PR adds returning the sourceUrl that a documentation chunk came from. I manually created a source_url column on the docs.postgres table that populates the data.

The data is mostly derived from the header_path, where for any header_path where length == 1, we use the first entry and for any header_path where length > 1, use use the second entry.

Given the version of the chunk, we start with the base url https://www.postgresql.org/docs/${version}/. Given the entry from above, if it matches the regex /\{#([^}]+)\}/, then we take the captured string and append it to our base url. If it doesn't match, then we take the entry and do entry.toLowerCase().replaceAll(' ', '-'). Given this transformation, we then append it to our base url with the .html prefix, and that gives us the url to the documentation page that chunk was gotten from.

For certain pages (e.g. those stemming from "Archive Modules" or "SPI"), then we have to look at the content to see if it matches a specific setup we can determine the URL from. For example, for an "Archive Module" chunk, it's content may start with a code fenced string, where we can plug that into the URL. For example, one such chunk starts with:

ALTER CONVERSION changes the definition of a conversion.

Where taking the code bit, lowercasing it and removing the spaces, and prefix it with sql- to get the actual URL: https://www.postgresql.org/docs/17/sql-alterconversion.html. See below for the full script I've made that gets a lot (but not all) of the chunk URLS. Right now it only does the root page and not the path on the page, but I can adjust it for that later.

URL Generation Script ```js import pg from 'pg';

const checkUrlPart = async (version: number, urlPart: string): Promise => {
// console.log(https://www.postgresql.org/docs/${version}/${urlPart}.html);
const res = await fetch(https://www.postgresql.org/docs/${version}/${urlPart}.html, {
method: 'HEAD',
});
// console.log(res.status);
return res.status === 200;
}

const pool = new pg.Pool({
connectionString: 'postgres://xxxx',
});

(async () => {
await pool.connect();

const result = await pool.query(SELECT id, header_path, header_depth, version, content FROM docs.postgres WHERE source_url IS NULL AND header_path[1] = 'ECPG --- Embedded SQL in C {#ecpg}' ORDER BY id DESC);

const promises: Promise[] = [];

const failedToFind: number[] = [];

for (const row of result.rows) {
const { id, header_path, header_depth, version, content } = row;
console.log(${id} - ${header_path.join(', ')} - ${version});
let urlPart: string = '';
if (header_path[0] === 'Archive Modules') {
if (header_path.length !== 2 || header_path[1] !== 'Description') {
// These are usually options or parameter chunks from the page, skipping them
// for right now.
continue;
}
const firstLine = content.split('\n', 1)[0].trim();
const firstWord = firstLine.split(' ', 1)[0].replaceAll('', '').toLowerCase(); let searchString = ''; if (firstWord.startsWith('pg_')) { searchString = firstWord; } else { const match = firstLine.match(/(.+?)/); if (match) { searchString = match[1]; } else { searchString = firstLine.split(' ', 1)[0]; } searchString = searchString.toLowerCase(); } if ( !searchString.includes(' ') && ( searchString === 'ecpg' || searchString === 'psql' || searchString === 'postgres' || searchString.startsWith('pg') || searchString.endsWith('db') || searchString.endsWith('user') ) ) { if (await checkUrlPart(version, app-${searchString.replaceAll('', '')})) { urlPart = app-${searchString.replaceAll('', '')}; } else if (await checkUrlPart(version, app-${searchString.replaceAll('', '-')})) { urlPart = app-${searchString.replaceAll('', '-')}; } else if (await checkUrlPart(version, ${searchString.replaceAll('', '')})) { urlPart = ${searchString.replaceAll('', '')}; } else if (await checkUrlPart(version, ${searchString.replaceAll('', '-')})) { urlPart = ${searchString.replaceAll('', '-')}; } else { failedToFind.push(id); continue; } } else { if (await checkUrlPart(version, sql-${searchString.replaceAll(' ', '')})) { urlPart = sql-${searchString.replaceAll(' ', '')}; } else if (await checkUrlPart(version, sql-${searchString.replaceAll(' ', '-')})) { urlPart = sql-${searchString.replaceAll(' ', '-')}; } else if (searchString.includes('operator') && await checkUrlPart(version, sql-${searchString.replace('operator', 'op').replaceAll(' ', '')})) { urlPart = sql-${searchString.replace('operator', 'op').replaceAll(' ', '')}; } else if (searchString.includes('text search')) { searchString = searchString.replace('text search', 'ts'); if (searchString.includes('configuration')) { searchString = searchString.replace('configuration', 'config'); } if (await checkUrlPart(version, sql-${searchString.replaceAll(' ', '')})) { urlPart = sql-${searchString.replaceAll(' ', '')}; } else { failedToFind.push(id); continue; } } else { failedToFind.push(id); continue; } } } else if (header_path[0] === "Server Programming Interface {#spi}" && header_path.length > 1) { if (header_path.length !== 2) { // Don't think there's any chunks like this, but just in case continue; } else if (header_path[1].includes('#')) { const match = header_path[1].match(/\{#([^}]+)\}/); if (!match) { // Unknown format, skip it continue; } urlPart = match[1]; } else if (header_path[1] === 'Description') { const match = content.match(/(.+?)/); if (!match) { continue; } const funcName = match[1].toLowerCase(); if (funcName === 'spi_repalloc') { urlPart = 'spi-realloc'; } else if (await checkUrlPart(version, spi-${funcName.replaceAll('', '-')})) { urlPart = spi-${funcName.replaceAll('', '-')}; } else if (await checkUrlPart(version, spi-${funcName.replaceAll('', '')})) { urlPart = spi-${funcName.replaceAll('', '')}; } else if (funcName.includes('tup') && !funcName.includes('tuple') && await checkUrlPart(version, spi-${funcName.replace('tup', 'tuple').replaceAll('', '-')})) { urlPart = spi-${funcName.replace('tup', 'tuple').replaceAll('', '-')}; } else { failedToFind.push(id); continue; } } else { // Chunks from the description page, skipping for now continue; } } else if (header_path[0] === 'ECPG --- Embedded SQL in C {#ecpg}' && header_path.length > 1) { if (header_path[1].includes('#')) { const match = header_path[1].match(/\{#([^}]+)\}/); if (!match) { // Unknown format, skip it continue; } urlPart = match[1]; } else { if (header_path[1] !== 'Description') { // Chunks from the description page, skipping for now continue; } const match = content.match(/(.+?)/); // The WHENEVER page is the only that doesn't have the keywords in backticks const stringMatch = match ? match[1].toLowerCase() : 'whenever'; if (await checkUrlPart(version, ecpg-sql-${stringMatch.replaceAll(' ', '-')})) { urlPart = ecpg-sql-${stringMatch.replaceAll(' ', '-')}`;
} else {
failedToFind.push(id);
continue;
}
}
} else {
const header = header_depth > 1 ? header_path[1]: header_path[0];
const match = header.match(/{#([^}]+)}/);
urlPart = match ? match[1] : header.toLowerCase().replaceAll(' ', '-');
}

const url = `https://www.postgresql.org/docs/${version}/${urlPart}.html`;
console.log('  -> ', url);
promises.push(
  pool.query(
    'UPDATE docs.postgres SET source_url = $1 WHERE id = $2', [url, id]
  ).then(() => {
    // Noop
  }).catch((err) => {
    console.error(err);
  }),
);

}

await Promise.allSettled(promises);

if (failedToFind.length > 0) {
console.log(Failed to find URL parts for IDs (${failedToFind.length}):);
console.log(JSON.stringify(failedToFind));
}
})()
.then(() => {
console.log('Done');
process.exit(0);
})
.catch((err) => {
console.error('Error:', err);
process.exit(1);
})
.finally(() => {
pool.end();
});

</details>

MasterOdin and others added 2 commits September 9, 2025 23:18
Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>
@MasterOdin MasterOdin merged commit 1e54c34 into main Sep 11, 2025
3 checks passed
@MasterOdin MasterOdin deleted the mpeveler/feat-postgres-source-rul branch September 11, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant