mozilla-services · jrconlin · Aug 14, 2025 · Aug 20, 2025 · Aug 20, 2025 · Aug 20, 2025
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -6,6 +6,12 @@ GitHub.
 
 [matrix]: https://chat.mozilla.org
 
+## Getting Started
+
+Merino uses `uv` as its python configuration management system. To initialize a new developer
+environment, please refer to
+[Developer documentation for working on Merino](https://mozilla-services.github.io/merino-py/dev/index.html)
+
 ## Bug Reports
 
 You can file issues here on GitHub. Please try to include as much information as
@@ -38,7 +44,7 @@ When submitting a PR:
 - Add both your code and new tests if relevant.
 - [Sign][sign] your git commit.
 - Run tests, linting and formatting checks to make sure your code complies with established standards.
-(e.g. No warnings are returned for python: "`make lint`", "`make test`", "`make format`")
+  (e.g. No warnings are returned for python: "`make lint`", "`make test`", "`make format`")
 - Ensure your changes do not reduce code coverage of the test suite.
 - Please do not include merge commits in pull requests; include only commits
   with the new relevant code.
@@ -62,15 +68,15 @@ Every patch must be peer reviewed.
 We loosely follow the [Angular commit guidelines][angular_commit_guidelines]
 of `<type>: <subject>` where `type` must be one of:
 
-* **feat**: A new feature
-* **fix**: A bug fix
-* **docs**: Documentation only changes
-* **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing
+- **feat**: A new feature
+- **fix**: A bug fix
+- **docs**: Documentation only changes
+- **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing
   semi-colons, etc)
-* **refactor**: A code change that neither fixes a bug or adds a feature
-* **perf**: A code change that improves performance
-* **test**: Adding missing tests, test maintenance, and adding test features
-* **chore**: Changes to the build process or auxiliary tools and libraries such as documentation
+- **refactor**: A code change that neither fixes a bug or adds a feature
+- **perf**: A code change that improves performance
+- **test**: Adding missing tests, test maintenance, and adding test features
+- **chore**: Changes to the build process or auxiliary tools and libraries such as documentation
   generation
 
 If associated with a Jira ticket, synchronization with Jira and GitHub is possible by appending the suffix of the Jiraticket to the branch name (`MOZ-1234` in the example below). Name the branch using this nomenclature with the optional `<type>` followed by a forward slash, followed by the Jira ticket and then a
@@ -85,9 +91,9 @@ commit messages to keep the task up to date. See Jira Docs for referencing issue
 
 The subject contains succinct description of the change:
 
-* use the imperative, present tense: "change" not "changed" nor "changes"
-* don't capitalize first letter
-* no dot (.) at the end
+- use the imperative, present tense: "change" not "changed" nor "changes"
+- don't capitalize first letter
+- no dot (.) at the end
 
 ### Body
 
@@ -129,57 +135,57 @@ following qualities (also known as the F.I.R.S.T. principles):
 
 **Fast**
 
-* Test suites should be optimized to execute quickly, which encourages use by contributors and is
+- Test suites should be optimized to execute quickly, which encourages use by contributors and is
   essential for rapid pipelines.
-* If tests are taking too long, consider breaking them up into multiple smaller tests and executing
+- If tests are taking too long, consider breaking them up into multiple smaller tests and executing
   them with a parallel test runner.
-* For reference, unit tests should take milliseconds to run.
+- For reference, unit tests should take milliseconds to run.
 
 **Isolated**
 
-* Tests should be independently executable or standalone, not relying on an execution order with
+- Tests should be independently executable or standalone, not relying on an execution order with
   other tests.
-* Tests should clean up after themselves
-    * Tests may perform actions that have persistent effects, like setting a program state. If not
-      cleaned up properly, this can impact the execution of subsequent tests causing intermittent
-      failures.
-    * Helper methods or fixtures are a preferable architectural design when compared to setup and
-      teardown methods. As test suites scale, setup and teardown methods become a common source of
-      state issues and bloat. Tests will grow to have increasingly specific requirements that don't
-      have common relevancy or compatibility.
-* Tests should not rely on resources from sites whose content Mozilla doesn't control or where
+- Tests should clean up after themselves
+  - Tests may perform actions that have persistent effects, like setting a program state. If not
+    cleaned up properly, this can impact the execution of subsequent tests causing intermittent
+    failures.
+  - Helper methods or fixtures are a preferable architectural design when compared to setup and
+    teardown methods. As test suites scale, setup and teardown methods become a common source of
+    state issues and bloat. Tests will grow to have increasingly specific requirements that don't
+    have common relevancy or compatibility.
+- Tests should not rely on resources from sites whose content Mozilla doesn't control or where
   Mozilla has no SLA. This is a security concern as well as a source of intermittent failures.
-* Tests should avoid file system or database dependencies, their changes or outages can be a source
+- Tests should avoid file system or database dependencies, their changes or outages can be a source
   of intermittent failures.
 
 **Repeatable**
 
-* Tests should have consistent results when no code changes are made between test runs.
-* Tests should not contain time bombs. Meaning they should not be susceptible to failure given
+- Tests should have consistent results when no code changes are made between test runs.
+- Tests should not contain time bombs. Meaning they should not be susceptible to failure given
   execution during a given datetime or due to time comparisons.
-* Tests should avoid using _magical_ time delays when waiting for operations to finish executing.
-* Tests should not assume an execution order for asynchronous methods, this may lead to intermittent
+- Tests should avoid using _magical_ time delays when waiting for operations to finish executing.
+- Tests should not assume an execution order for asynchronous methods, this may lead to intermittent
   failures.
-* Tests should not depend on objects through weak references, which may be garbage collected during
+- Tests should not depend on objects through weak references, which may be garbage collected during
   test execution causing intermittent failures.
-* Tests should avoid using logical operations, such as `if`, `while`, `for`, and `switch`.
-    * Decision points are a nexus for introducing bugs. Bugs found in test code erodes trust and
-      diminishes value.
-    * Consider splitting up a test or using a data driven test framework before using logical
-      operations.
-* Test CI/CD jobs should avoid pulling the latest language or tooling dependencies. New dependency
+- Tests should avoid using logical operations, such as `if`, `while`, `for`, and `switch`.
+  - Decision points are a nexus for introducing bugs. Bugs found in test code erodes trust and
+    diminishes value.
+  - Consider splitting up a test or using a data driven test framework before using logical
+    operations.
+- Test CI/CD jobs should avoid pulling the latest language or tooling dependencies. New dependency
   versions can cause unintuitive failures.
 
 **Self-Deterministic**
 
-* No human intervention should be required to conclude if a test has passed or failed.
+- No human intervention should be required to conclude if a test has passed or failed.
 
 **Timely**
 
-* Production code should be crafted to be testable, so that writing and maintaining tests doesn't
+- Production code should be crafted to be testable, so that writing and maintaining tests doesn't
   take an unreasonable amount of time.
-* Test code should be written with the same quality standard as production code and subject to the
+- Test code should be written with the same quality standard as production code and subject to the
   same linters and formatters.
-* Consider using a test-first approach when developing production code.
+- Consider using a test-first approach when developing production code.
 
 [test_strategy]: /docs/dev/testing.md
diff --git a/docs/noob_guide.md b/docs/noob_guide.md
@@ -0,0 +1,133 @@
+# A Gentle Guide for the New Person
+
+This guide presumes that you know what [Merino](intro.md), are familiar with programming in Python 3.12+, and are looking to incorporate a new service.
+
+## Types of Merino Services
+
+Merino has two ways to provide suggestions, _off-line_ (which uses user agent locally stored data provided by Remote Settings) and _on-line_ (which provides more timely data by providing live responses to queries).
+
+_off-line_ data sets are generally smaller, since we have limited storage capacity available. These may use the [`csv_rs_uploader`](../merino/jobs/csv_rs_uploader) command. A good example of this is the []`wikipedia_offline_uploader`](../merino/jobs/wikipedia_offline_uploader) job.
+
+_on-line_ data do not necessarily have the same size restrictions, but are instead constrained by time. These services should return a response in less than 200ms.
+
+<a name="jobs"/>
+## Merino Jobs
+
+"Jobs" are various tasks that can be executed by Merino, and are located in the `./merino/jobs` directory. These jobs are invoked by calling `uv run merino-jobs {job_name}`. Running without a `{job_name}` returns a list of available jobs that can be run. For example:
+
+```bash
+> uv run merino-jobs
+
+ Usage: merino-jobs [OPTIONS] COMMAND [ARGS]...
+
+ CLI Entrypoint
+
+╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --help          Show this message and exit.                                                            │
+╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ wikipedia-indexer           Commands for indexing Wikipedia exports into Elasticsearch                 │
+│ navigational-suggestions    Command for preparing top domain metadata for navigational suggestions     │
+│ amo-rs-uploader             Command for uploading AMO add-on suggestions to remote settings            │
+│ csv-rs-uploader             Command for uploading suggestions from a CSV file to remote settings       │
+│ relevancy-csv-rs-uploader   Command for uploading domain data from a CSV file to remote settings       │
+│ geonames-uploader           Uploads GeoNames data to remote settings                                   │
+│ wiki-offline-uploader       Command for uploading wiki suggestions                                     │
+│ polygon-ingestion           Commands to download ticker logos, upload to GCS, and generate manifest    │
+╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+```
+
+Please note that file paths presume you are in the Project Root directory.
+
+### Ingestion
+
+A significant portion of work involves fetching and normalizing data.
+
+#### Code
+
+##### Jobs
+
+The ingestion applications are stored under `./merino/jobs/` each provider has it's own application, since each provider is slightly different. For consistency, we use [Typer](https://typer.tiangolo.com/tutorial/) to describe the command,
+
+_**TODO**_ Having weird problems defining(?)/activating(?) Options. Not sure how they're supposed to be passed along.
+
+###### Suggest
+
+Suggest operates by exposing a REST like interface. Code is structured so that:
+
+A **Provider** instantiates it's service (see _initialize()_) and handles the incoming HTTP request (see _query()_).
+Providers instantiate a **Backend**, which resolves individual datum (See _query(str)_) requests and returns a list of `merino.providers.suggest.base.BaseSuggestion`. The Backend is also responsible for managing and updating the **Manifest** data block (see [Manifest](#manifest)) via the `fetch_manifest_data()` and `build_and_upload_manifest_file()` methods.
+
+##### Configuration
+
+Configurations for the ingestion processes are stored under `./merino/configs` and are sets of TOML files. These include:
+
+- `ci.toml` - Continuous Integration configurations (Use only for CI tasks)
+- `default.toml` - Common, core settings. These are over-ridden by the platform specific configurations.
+- `development.toml`, etc. - The platform specific configurations to use. These will eventually be replaced by a single, composed `platform.toml`(name TBD).
+
+Validators for the configuration options are stored in the `./merino/configs/__init__.py` file
+
+#### Curated Recommendations
+
+Provides the set of Curated items on the New Tab page. (Probably don't want to go there, ask for help.)
+
+#### Governance
+
+Provides a set of "Circuit breakers" to interrupt long running or over burdensome processes
+
+#### Jobs
+
+This is the set of Merino Jobs that can be either run via cron, or as singletons. See [_Merino Jobs_](#jobs) above.
+
+#### Middleware
+
+The set of functions called on every request/response by the Merino system.
+
+#### Providers
+
+This is where many of the Merino provider APIs are defined. Things often blend between Web and Jobs, so it can be confusing to sort them out.
+
+<a name="manifest" />
+
+##### Manifest
+
+A `Manifest` in this context is the site metadata associated with a given provider. This metadata can include things like the site icon, description, weight, and other data elements (_**TODO**_: Need to understand this data better).
+
+Metadata is generally fetched from the site by a `job`, which may call a `Provider._fetch_manifest()` method to create and upload the data to a GCS bucket. This can be wrapped by the `merino.providers.manifest.backends.protocol.ManifestBackend.fetch()` If needed later by Merino web services, that bucket will be read and the Manifest data used to construct the `Suggestion`.
+
+`Manifest`s contain a list of `Domain`s and a list of partner dictionaries.
+
+`Domain`s are:
+
+- **rank**: unique numeric ranking for this item.
+- **domain**: the host domain without extension (e.g. for `example.com` the domain would be `example`)
+- **categories**: a list of business categories for this domain (**TODO**: where are these defined?)
+- **url**: the main site URL
+- **title**: site title or brief description
+- **icon**: URL to the icon stored in CDN
+- **serp_categories**: list of numeric category codes (defined by `merino.providers.suggest.base.Category`)
+- **similars**: [Optional] Similar words or common misspellings.
+
+Partners are a set of dictionaries that contain values about **TODO**: ???. The dictionaries may specify values such as:
+
+- **"domain"**: the host name of the partner (e.g. `example.com`)
+- **"url"**: preferred URL to the partner
+- **"original_icon_url"**: non-cached, original source URL for the icon.
+- **"gcs_icon_url"**: URL to the icon stored in CDN
+
+It's important to note that the `Manifest` is a [Pydantic BaseModel](https://docs.pydantic.dev/latest/api/base_model/), and as such, the elements are not directly accessible.
+
+##### Suggest
+
+Each `Provider` has specific code relating to how the data should be fetched and displayed. Categories of providers can be gathered under a group to take advantage of python subclassing. Once created, the provider can be included in the Merino suggestion groups by updating `merino.providers.suggest.manager._create_provider()`.
+
+See `merino.providers.skeleton` for a general use template that modules could use.
+
+#### Scripts
+
+Simple utility scripts that may be useful.
+
+#### Tests
+
+The bank of tests (Unit and Integration) to validate code changes. All code changes should include appropriate tests.
diff --git a/merino/jobs/cli.py b/merino/jobs/cli.py
@@ -12,6 +12,8 @@
 from merino.jobs.wikipedia_offline_uploader import wiki_offline_uploader_cmd
 from merino.jobs.polygon import cli as polygon_ingestion_cmd
 
+# Include your new jobs module here.
+
 # NOTE: `pretty_exceptions_show_locals` argument is set to False to avoid api_key and secrets exposure.
 cli = typer.Typer(no_args_is_help=True, add_completion=False, pretty_exceptions_show_locals=False)
 
@@ -36,6 +38,9 @@
 # Add the polygon ingest subcommand
 cli.add_typer(polygon_ingestion_cmd, no_args_is_help=True)
 
+# Describe this command and link to it.
+# cli.add_typer(skeleton_cmd, no_args_is_help=True)
+
 
 @cli.callback()
 def setup():