Merge branch 'describe-architecture-with-hugo'

dscho · dscho · commit 2be984ae1f5d · 2024-09-20T16:25:26.000+02:00
Now that the site is converted to be built with Hugo and Pagefind, let's
reflect that status quo in the document describing the site's
architecture.

Signed-off-by: Johannes Schindelin &lt;johannes.schindelin@gmx.de&gt;
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -1,161 +1,97 @@
 # git-scm.com architecture
 
 This document describes the general setup and architecture that runs the
-git-scm.com site. The idea is to document all the moving parts that
-_aren't_ checked in to this repository. That may help new people joining
-the project to help out, as well provide some continuity in case the
-maintainer is hit by a bus.
+git-scm.com site.
 
 ## Content
 
-Though the site is a rails app, it can _mostly_ be thought of as serving
-static content. It's just that we suck in that static content and
-pre-process it using nightly scheduled jobs. We never write anything to
-the database on behalf of user requests.
+This site is served via GitHub Pages and is a [Hugo](https://gohugo.io/) site
+with the search implemented using [Pagefind](https://pagefind.app/).
 
 The content is a mix of:
 
-  - actual static content in this repository
+  - original content from this repository
 
   - community book content brought in from https://github.com/progit;
-    see the `lib/tasks/book2.rake` file.
+    see the `script/update-book2.rb` and `script/book.rb` files.
 
-  - manpages from releases of the git project, imported and formatted
-    via asciidoctor; see the `lib/tasks/index.rake` task.
+    The content is pre-rendered and tracked in the `external/book/` directory
+    tree.
 
+  - manual pages from releases of the git project, imported and formatted via
+    AsciiDoctor, and translated versions of the manual pages from
+    https://github.com/jnavila/git-manpages-l10n/ (which itself contains
+    pre-rendered pages from https://github.com/jnavila/git-manpages-l10n/); see
+    the `script/update-docs.rb` file.
 
-## Heroku
+    The pre-rendered pages are tracked in the `external/docs/` directory tree.
 
-The app itself is served by Heroku. The app name is `git-scm` (so you
-can visit it directly as https://git-scm.herokuapp.com). The site is
-owned by the git-scm.com team. If you want to be involved in managing
-uptime/deploys/etc, you'll need a Heroku account and request to be added
-to that team.
+To deploy to GitHub Pages, it is necessary to turn off the default setting to
+"publish from a branch" and instead change the setting to "publish with a
+custom GitHub Actions workflow":
+https://docs.github.com/en/pages/getting-started-with-github-pages/configuring-a-publishing-source-for-your-github-pages-site#publishing-with-a-custom-github-actions-workflow
+With this change, the site can be tested in the fork by pushing to the
+`gh-pages` branch (which will trigger the `deploy.yml` workflow) and then
+navigating to https://git-scm.<user>.github.io/.
 
-We use a few Heroku add-ons:
+## Non-static parts
 
-  - Bonsai elasticsearch (see below)
+While the site consists mostly of static content, there are a couple of
+parts that are sort of dynamic.
 
-  - Heroku Postgres as the database
+The search is implemented client-side, via [Pagefind](https://pagefind.app/).
 
-  - Heroku Redis for rails caching
+A few scheduled GitHub workflows keep the content up to date:
 
-  - Heroku scheduler for cron jobs
+  - `update-git-version-and-manual-pages` and `update-download-data` (pick
+    up newly released git versions)
 
-The nightly scheduled jobs are:
+  - `update-translated-manual-pages` (fetch and format translated manual
+    pages from the jnavila/git-html-l10n repository)
 
-  - `rake downloads` (pick up newly released git versions)
-
-  - `rake preindex` (pull in and format manpages for released git
-    versions)
-
-  - `rake remote_genbook2` (pull in and format progit2 book content,
+  - `update-book` (fetch and format progit2 book content,
     including translations)
 
-It should be safe to run any of those jobs more frequently. E.g., if you
-know there's a new Git release out, then:
-
-    heroku run rake preindex
-    heroku run rake downloads
-
-will get it on the site without waiting for the nightly run.
-
-Merges to the `main` branch on GitHub auto-deploy to Heroku, so unless
-you're doing something tricky you generally shouldn't need to manually
-deploy.
-
-Note that some of the formatting of manpages and book content happens
-when they are imported by the rake tasks. So after fixing some
-formatting and deploying, the rake jobs may need to be re-run with a
-special flag to re-import (see the individual tasks for details).
-
-
-## Cloudflare
-
-We get enough requests that it's easy to overwhelm the single Heroku
-dyno. So we have Cloudflare sitting in front of it, aggressively caching
-everything. That also should make the site faster to serve to regions
-far away from Heroku's servers.
-
-The Cloudflare setup is mostly pretty simple:
+These workflows are also marked as `workflow_dispatch`, i.e. they can be run
+manually (e.g. to update the download links just after Git for Windows
+published a new release).
 
- - they serve DNS for the whole domain (that's where they insert the CDN
-   magic)
-
- - Cloudflare provides `https://` support to the user. Obviously the
-   site is totally open and doesn't have any sensitive data, so this is
-   really more about integrity. The certificate is generated by
-   Cloudflare (and requires SNI on the browser side).
-
- - the Cloudflare connection to Heroku is passed over TLS; they provide an
-   "internal" certificate that we ask Heroku to use, so the connection
-   is secured between the two (again, mostly for integrity)
-
- - the most exotic config is that we use "page rules" to mark the whole
-   site to be cached aggressively, regardless of any caching headers
-   sent from Heroku. This is a bit of a hack, but there's very little on
-   the site that can't be cached (which is perhaps a sign that the rails
-   setup needs to be tweaked to send more reasonable caching headers,
-   but this has been simple and effective so far).
-
-   There are a few special page rules to lift this caching for cases
-   where we do server-side logic (e.g.,
-   https://github.com/git/git-scm.com/issues/1129#issuecomment-363067019"),
-   but the long-term goal is to push that logic onto the client side as
-   much as possible.
-
-Both domains (c.f., the section on [DNS](#DNS) below) are owned by a
-Cloudflare "Team", and membership of that team is required to
-administrate the domains. Similar to the Heroku setup, you can ask to
-join this team if you wish to help out. The information about the team
-setup is in escrow with the Git PLC at Software Freedom Conservancy.
-Cloudflare provides the project with enough credits that it doesn't cost
-anything (though we're not using very many features, so it's possible
-that a free account would be sufficient, too).
-
-## Bonsai Elasticsearch
-
-The search functionality on the site is served by an elasticsearch
-cluster. The index can be populated by running `rake search_index`
-(manpages) and `rake search_index_book` (book) on Heroku (we only index
-the manpages and book). This perhaps should be run nightly, or at least
-after pulling in new content, but it currently isn't done automatically.
-
-The elasticsearch cluster is provided by Bonsai via their Heroku plugin.
-Our needs are larger than their free tier provides, but we receive
-credits from them that provide the service for free.
+Merges to the `gh-pages` branch on GitHub auto-deploy to GitHub Pages via the
+`deploy` GitHub workflow.
 
+Note that some of the formatting of manual pages and book content happens
+when they are imported by the GitHub workflows. Therefore, whenever there are
+changes to the scripts/workflows/automation that affect formatting, these
+workflows may need to be triggered using the force-rebuild flag to be toggled
+(see the individual workflows for details).
 
 ## DNS
 
-The actual DNS service is provided by Cloudflare (see above). The domain
-itself is registered with Gandi, and is owned by the project via
-Software Freedom Conservancy. Funds for the registration are provided
-from the Git project's Conservancy funds, and both the Git PLC and
-Conservancy have credentials to modify the setup.
+The actual DNS service is provided by Cloudflare. The domain itself is
+registered with Gandi, and is owned by the project via Software Freedom
+Conservancy. Funds for the registration are provided from the Git project's
+Conservancy funds, and both the Git PLC and Conservancy have credentials to
+modify the setup.
 
 Note that we own both git-scm.com and git-scm.org; the latter redirects
 to the former.
 
-
 ## Manual Intervention
 
 The site mostly just runs without intervention:
 
-  - code merged to `main` is auto-deployed
+  - code merged to `gh-pages` is auto-deployed
 
-  - new git versions are detected daily and manpages and download links
+  - new git versions are detected daily and manual pages and download links
     updated
 
   - book updates (including translations) are picked up daily
 
 There are a few tasks that still need to be handled by a human:
 
-  - new images added to the book have to be copied manually from
-    progit/progit2
-
   - new languages for book translations need to be added to
-    `lib/tasks/book2.rake`
+    `script/book.rb`
 
-  - forced re-imports of content (e.g., a formatting fix to imported
-    manpages) must be triggered manually
+  - forced re-imports of content (e.g., when fixing formatting in the
+    imported manual pages) must be triggered manually with `force-rebuild`
+    toggled