Skip to content

Normalize content directory names#3815

Merged
dannon merged 11 commits intogalaxyproject:mainfrom
dannon:feature/normalize-content-dirs
Mar 12, 2026
Merged

Normalize content directory names#3815
dannon merged 11 commits intogalaxyproject:mainfrom
dannon:feature/normalize-content-dirs

Conversation

@dannon
Copy link
Member

@dannon dannon commented Mar 7, 2026

Summary

Split out from #3804 — this is the content directory normalization work, separated from the subsite insert fix.

  • Drop letter↔digit split rules from slug normalization and rename camelCase/underscore content directories to kebab-case
  • Add CI lint check to catch non-normalized content directory names going forward

@dannon dannon force-pushed the feature/normalize-content-dirs branch from ac10455 to 8719cb7 Compare March 9, 2026 19:26
@bgruening
Copy link
Member

I'm concerned we break links here or elsewhere. Can we dump the old sitemap somewhere and preserve it? Then use https://github.com/galaxyproject/galaxy-hub/blob/main/scripts/compare_sitemaps.py or something similar to test that the new page with all those changes here still serve all the old links and that the old sitemap is a subset of the new one? Does that make sense to you?

@dannon
Copy link
Member Author

dannon commented Mar 10, 2026

These should all be covered — the slug normalization system detects when a directory rename changes the URL and stores the original path as naturalSlug in frontmatter, which triggers redirect page generation at the old URLs. Those redirects also get baked as S3 object metadata (#3802) for native 301s, so old links keep working.

Can definitely do a sitemap comparison too if you want the extra confidence though.

@bgruening
Copy link
Member

Can definitely do a sitemap comparison too if you want the extra confidence though.

I would feel better with that, also if we dump the sitemap of the old website somewhere before we take it offline, just to make sure we have some trace when we need to touch all of this for the next migration.

@dannon
Copy link
Member Author

dannon commented Mar 11, 2026

Pre-normalization sitemap snapshot

Built from main at 3954094 (pre-merge): https://gist.github.com/dannon/0bfc270430b48ba313e7083bf0a9573c

5396 URLs total.

Sitemap comparison (main → this branch)

After rebasing onto main and rebuilding, the post-normalization build also produces 5396 URLs. 780 URL paths changed between the two builds:

  • ~130 from actual directory renames (camelCase/underscore → kebab-case) — these are covered by redirects in redirects.yaml and generated redirect HTML pages in public/
  • ~650 from dropping the letter↔digit split rule in the slug normalizer (e.g., gcc-2012gcc2012) — these never existed on a deployed site, so no redirects needed

Every directory rename that was deployed has a corresponding redirect entry.

dannon added 4 commits March 11, 2026 16:17
…se/underscore dirs only

Removes the letter↔digit boundary rules from normalizeSlugSegment — they were
splitting too many meaningful identifiers (gcc2026, orf3a, ga4gh, nsp2, etc.)
into bad URL segments. camelCase and underscore→hyphen rules are kept.

Adds mi-rna→mirna and ma-gs→mags slug overrides for the two bioinformatics
terms that the camelCase rule still splits badly.

135 content directories renamed via git mv (parents before children).
135 redirect entries added to redirects.yaml covering all old paths.
129 collision cases skipped (both old and new name already exist separately).

CloudFront function and test suite updated to match the simplified algorithm.
New script check-dir-names.mjs walks content/ and flags any directory
whose name doesn't match its normalizeSlugSegment() form. Wired into
npm run content:lint so CI catches newly added non-normalized dirs.

content/.slug-bypass lists the 129 known collision paths that are
acknowledged exceptions (both old and new-cased dirs exist on disk).
Contributors can add their own bypass entries to suppress the check,
which makes the exception explicit and reviewable.
Generates a lightweight slug-lookup file (404-lookup.json) at build time
that maps skeleton keys (alphanumeric-only, lowercased paths) to canonical
URLs and titles. The 404 page fetches this and tries to match the current
URL, handling differences in casing, hyphens, underscores, and camelCase.
Also pre-populates the search link with keywords extracted from the URL.
@dannon dannon force-pushed the feature/normalize-content-dirs branch from b8a739f to 7235ed1 Compare March 11, 2026 20:17
Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannon please merge when its green. I will hold of merging other stuff until then.

@dannon
Copy link
Member Author

dannon commented Mar 12, 2026

@bgruening Will do, I'll check out what just broke -- was definitely green earlier! :)

@dannon dannon enabled auto-merge March 12, 2026 15:04
@dannon dannon merged commit e273f58 into galaxyproject:main Mar 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants