feat: return sourceUrl for postgres docs#4
Merged
MasterOdin merged 2 commits intomainfrom Sep 11, 2025
Merged
Conversation
Signed-off-by: Matthew Peveler <matt.peveler@gmail.com>
Merged
MasterOdin
added a commit
that referenced
this pull request
Sep 23, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR adds returning the
sourceUrlthat a documentation chunk came from. I manually created asource_urlcolumn on thedocs.postgrestable that populates the data.The data is mostly derived from the
header_path, where for any header_path where length == 1, we use the first entry and for any header_path where length > 1, use use the second entry.Given the version of the chunk, we start with the base url
https://www.postgresql.org/docs/${version}/. Given the entry from above, if it matches the regex/\{#([^}]+)\}/, then we take the captured string and append it to our base url. If it doesn't match, then we take the entry and doentry.toLowerCase().replaceAll(' ', '-'). Given this transformation, we then append it to our base url with the.htmlprefix, and that gives us the url to the documentation page that chunk was gotten from.For certain pages (e.g. those stemming from "Archive Modules" or "SPI"), then we have to look at the content to see if it matches a specific setup we can determine the URL from. For example, for an "Archive Module" chunk, it's content may start with a code fenced string, where we can plug that into the URL. For example, one such chunk starts with:
Where taking the code bit, lowercasing it and removing the spaces, and prefix it with
sql-to get the actual URL: https://www.postgresql.org/docs/17/sql-alterconversion.html. See below for the full script I've made that gets a lot (but not all) of the chunk URLS. Right now it only does the root page and not the path on the page, but I can adjust it for that later.URL Generation Script
```js import pg from 'pg';const checkUrlPart = async (version: number, urlPart: string): Promise => {
// console.log(
https://www.postgresql.org/docs/${version}/${urlPart}.html);const res = await fetch(
https://www.postgresql.org/docs/${version}/${urlPart}.html, {method: 'HEAD',
});
// console.log(res.status);
return res.status === 200;
}
const pool = new pg.Pool({
connectionString: 'postgres://xxxx',
});
(async () => {
await pool.connect();
const result = await pool.query(
SELECT id, header_path, header_depth, version, content FROM docs.postgres WHERE source_url IS NULL AND header_path[1] = 'ECPG --- Embedded SQL in C {#ecpg}' ORDER BY id DESC);const promises: Promise[] = [];
const failedToFind: number[] = [];
for (const row of result.rows) {
const { id, header_path, header_depth, version, content } = row;
console.log(
${id} - ${header_path.join(', ')} - ${version});let urlPart: string = '';
if (header_path[0] === 'Archive Modules') {
if (header_path.length !== 2 || header_path[1] !== 'Description') {
// These are usually options or parameter chunks from the page, skipping them
// for right now.
continue;
}
const firstLine = content.split('\n', 1)[0].trim();
const firstWord = firstLine.split(' ', 1)[0].replaceAll('
', '').toLowerCase(); let searchString = ''; if (firstWord.startsWith('pg_')) { searchString = firstWord; } else { const match = firstLine.match(/(.+?)/); if (match) { searchString = match[1]; } else { searchString = firstLine.split(' ', 1)[0]; } searchString = searchString.toLowerCase(); } if ( !searchString.includes(' ') && ( searchString === 'ecpg' || searchString === 'psql' || searchString === 'postgres' || searchString.startsWith('pg') || searchString.endsWith('db') || searchString.endsWith('user') ) ) { if (await checkUrlPart(version,app-${searchString.replaceAll('', '')})) { urlPart =app-${searchString.replaceAll('', '')}; } else if (await checkUrlPart(version,app-${searchString.replaceAll('', '-')})) { urlPart =app-${searchString.replaceAll('', '-')}; } else if (await checkUrlPart(version,${searchString.replaceAll('', '')})) { urlPart =${searchString.replaceAll('', '')}; } else if (await checkUrlPart(version,${searchString.replaceAll('', '-')})) { urlPart =${searchString.replaceAll('', '-')}; } else { failedToFind.push(id); continue; } } else { if (await checkUrlPart(version,sql-${searchString.replaceAll(' ', '')})) { urlPart =sql-${searchString.replaceAll(' ', '')}; } else if (await checkUrlPart(version,sql-${searchString.replaceAll(' ', '-')})) { urlPart =sql-${searchString.replaceAll(' ', '-')}; } else if (searchString.includes('operator') && await checkUrlPart(version,sql-${searchString.replace('operator', 'op').replaceAll(' ', '')})) { urlPart =sql-${searchString.replace('operator', 'op').replaceAll(' ', '')}; } else if (searchString.includes('text search')) { searchString = searchString.replace('text search', 'ts'); if (searchString.includes('configuration')) { searchString = searchString.replace('configuration', 'config'); } if (await checkUrlPart(version,sql-${searchString.replaceAll(' ', '')})) { urlPart =sql-${searchString.replaceAll(' ', '')}; } else { failedToFind.push(id); continue; } } else { failedToFind.push(id); continue; } } } else if (header_path[0] === "Server Programming Interface {#spi}" && header_path.length > 1) { if (header_path.length !== 2) { // Don't think there's any chunks like this, but just in case continue; } else if (header_path[1].includes('#')) { const match = header_path[1].match(/\{#([^}]+)\}/); if (!match) { // Unknown format, skip it continue; } urlPart = match[1]; } else if (header_path[1] === 'Description') { const match = content.match(/(.+?)/); if (!match) { continue; } const funcName = match[1].toLowerCase(); if (funcName === 'spi_repalloc') { urlPart = 'spi-realloc'; } else if (await checkUrlPart(version,spi-${funcName.replaceAll('', '-')})) { urlPart =spi-${funcName.replaceAll('', '-')}; } else if (await checkUrlPart(version,spi-${funcName.replaceAll('', '')})) { urlPart =spi-${funcName.replaceAll('', '')}; } else if (funcName.includes('tup') && !funcName.includes('tuple') && await checkUrlPart(version,spi-${funcName.replace('tup', 'tuple').replaceAll('', '-')})) { urlPart =spi-${funcName.replace('tup', 'tuple').replaceAll('', '-')}; } else { failedToFind.push(id); continue; } } else { // Chunks from the description page, skipping for now continue; } } else if (header_path[0] === 'ECPG --- Embedded SQL in C {#ecpg}' && header_path.length > 1) { if (header_path[1].includes('#')) { const match = header_path[1].match(/\{#([^}]+)\}/); if (!match) { // Unknown format, skip it continue; } urlPart = match[1]; } else { if (header_path[1] !== 'Description') { // Chunks from the description page, skipping for now continue; } const match = content.match(/(.+?)/); // The WHENEVER page is the only that doesn't have the keywords in backticks const stringMatch = match ? match[1].toLowerCase() : 'whenever'; if (await checkUrlPart(version,ecpg-sql-${stringMatch.replaceAll(' ', '-')})) { urlPart =ecpg-sql-${stringMatch.replaceAll(' ', '-')}`;} else {
failedToFind.push(id);
continue;
}
}
} else {
const header = header_depth > 1 ? header_path[1]: header_path[0];
const match = header.match(/{#([^}]+)}/);
urlPart = match ? match[1] : header.toLowerCase().replaceAll(' ', '-');
}
}
await Promise.allSettled(promises);
if (failedToFind.length > 0) {
console.log(
Failed to find URL parts for IDs (${failedToFind.length}):);console.log(JSON.stringify(failedToFind));
}
})()
.then(() => {
console.log('Done');
process.exit(0);
})
.catch((err) => {
console.error('Error:', err);
process.exit(1);
})
.finally(() => {
pool.end();
});