Normalize content directory names#3815
Conversation
ac10455 to
8719cb7
Compare
|
I'm concerned we break links here or elsewhere. Can we dump the old sitemap somewhere and preserve it? Then use https://github.com/galaxyproject/galaxy-hub/blob/main/scripts/compare_sitemaps.py or something similar to test that the new page with all those changes here still serve all the old links and that the old sitemap is a subset of the new one? Does that make sense to you? |
|
These should all be covered — the slug normalization system detects when a directory rename changes the URL and stores the original path as Can definitely do a sitemap comparison too if you want the extra confidence though. |
I would feel better with that, also if we dump the sitemap of the old website somewhere before we take it offline, just to make sure we have some trace when we need to touch all of this for the next migration. |
Pre-normalization sitemap snapshotBuilt from 5396 URLs total. Sitemap comparison (main → this branch)After rebasing onto main and rebuilding, the post-normalization build also produces 5396 URLs. 780 URL paths changed between the two builds:
Every directory rename that was deployed has a corresponding redirect entry. |
…se/underscore dirs only Removes the letter↔digit boundary rules from normalizeSlugSegment — they were splitting too many meaningful identifiers (gcc2026, orf3a, ga4gh, nsp2, etc.) into bad URL segments. camelCase and underscore→hyphen rules are kept. Adds mi-rna→mirna and ma-gs→mags slug overrides for the two bioinformatics terms that the camelCase rule still splits badly. 135 content directories renamed via git mv (parents before children). 135 redirect entries added to redirects.yaml covering all old paths. 129 collision cases skipped (both old and new name already exist separately). CloudFront function and test suite updated to match the simplified algorithm.
New script check-dir-names.mjs walks content/ and flags any directory whose name doesn't match its normalizeSlugSegment() form. Wired into npm run content:lint so CI catches newly added non-normalized dirs. content/.slug-bypass lists the 129 known collision paths that are acknowledged exceptions (both old and new-cased dirs exist on disk). Contributors can add their own bypass entries to suppress the check, which makes the exception explicit and reviewable.
Generates a lightweight slug-lookup file (404-lookup.json) at build time that maps skeleton keys (alphanumeric-only, lowercased paths) to canonical URLs and titles. The 404 page fetches this and tries to match the current URL, handling differences in casing, hyphens, underscores, and camelCase. Also pre-populates the search link with keywords extracted from the URL.
b8a739f to
7235ed1
Compare
…ing a suggestion link
|
@bgruening Will do, I'll check out what just broke -- was definitely green earlier! :) |
…-2026, egd-2025) and update test URLs to match
Summary
Split out from #3804 — this is the content directory normalization work, separated from the subsite insert fix.