Skip to content
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 47 additions & 41 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ GitHub.

[matrix]: https://chat.mozilla.org

## Getting Started

Merino uses `uv` as its python configuration management system. To initialize a new developer
environment, please refer to
[Developer documentation for working on Merino](https://mozilla-services.github.io/merino-py/dev/index.html)

## Bug Reports

You can file issues here on GitHub. Please try to include as much information as
Expand Down Expand Up @@ -38,7 +44,7 @@ When submitting a PR:
- Add both your code and new tests if relevant.
- [Sign][sign] your git commit.
- Run tests, linting and formatting checks to make sure your code complies with established standards.
(e.g. No warnings are returned for python: "`make lint`", "`make test`", "`make format`")
(e.g. No warnings are returned for python: "`make lint`", "`make test`", "`make format`")
- Ensure your changes do not reduce code coverage of the test suite.
- Please do not include merge commits in pull requests; include only commits
with the new relevant code.
Expand All @@ -62,15 +68,15 @@ Every patch must be peer reviewed.
We loosely follow the [Angular commit guidelines][angular_commit_guidelines]
of `<type>: <subject>` where `type` must be one of:

* **feat**: A new feature
* **fix**: A bug fix
* **docs**: Documentation only changes
* **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing
- **feat**: A new feature
- **fix**: A bug fix
- **docs**: Documentation only changes
- **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing
semi-colons, etc)
* **refactor**: A code change that neither fixes a bug or adds a feature
* **perf**: A code change that improves performance
* **test**: Adding missing tests, test maintenance, and adding test features
* **chore**: Changes to the build process or auxiliary tools and libraries such as documentation
- **refactor**: A code change that neither fixes a bug or adds a feature
- **perf**: A code change that improves performance
- **test**: Adding missing tests, test maintenance, and adding test features
- **chore**: Changes to the build process or auxiliary tools and libraries such as documentation
generation

If associated with a Jira ticket, synchronization with Jira and GitHub is possible by appending the suffix of the Jiraticket to the branch name (`MOZ-1234` in the example below). Name the branch using this nomenclature with the optional `<type>` followed by a forward slash, followed by the Jira ticket and then a
Expand All @@ -85,9 +91,9 @@ commit messages to keep the task up to date. See Jira Docs for referencing issue

The subject contains succinct description of the change:

* use the imperative, present tense: "change" not "changed" nor "changes"
* don't capitalize first letter
* no dot (.) at the end
- use the imperative, present tense: "change" not "changed" nor "changes"
- don't capitalize first letter
- no dot (.) at the end

### Body

Expand Down Expand Up @@ -129,57 +135,57 @@ following qualities (also known as the F.I.R.S.T. principles):

**Fast**

* Test suites should be optimized to execute quickly, which encourages use by contributors and is
- Test suites should be optimized to execute quickly, which encourages use by contributors and is
essential for rapid pipelines.
* If tests are taking too long, consider breaking them up into multiple smaller tests and executing
- If tests are taking too long, consider breaking them up into multiple smaller tests and executing
them with a parallel test runner.
* For reference, unit tests should take milliseconds to run.
- For reference, unit tests should take milliseconds to run.

**Isolated**

* Tests should be independently executable or standalone, not relying on an execution order with
- Tests should be independently executable or standalone, not relying on an execution order with
other tests.
* Tests should clean up after themselves
* Tests may perform actions that have persistent effects, like setting a program state. If not
cleaned up properly, this can impact the execution of subsequent tests causing intermittent
failures.
* Helper methods or fixtures are a preferable architectural design when compared to setup and
teardown methods. As test suites scale, setup and teardown methods become a common source of
state issues and bloat. Tests will grow to have increasingly specific requirements that don't
have common relevancy or compatibility.
* Tests should not rely on resources from sites whose content Mozilla doesn't control or where
- Tests should clean up after themselves
- Tests may perform actions that have persistent effects, like setting a program state. If not
cleaned up properly, this can impact the execution of subsequent tests causing intermittent
failures.
- Helper methods or fixtures are a preferable architectural design when compared to setup and
teardown methods. As test suites scale, setup and teardown methods become a common source of
state issues and bloat. Tests will grow to have increasingly specific requirements that don't
have common relevancy or compatibility.
- Tests should not rely on resources from sites whose content Mozilla doesn't control or where
Mozilla has no SLA. This is a security concern as well as a source of intermittent failures.
* Tests should avoid file system or database dependencies, their changes or outages can be a source
- Tests should avoid file system or database dependencies, their changes or outages can be a source
of intermittent failures.

**Repeatable**

* Tests should have consistent results when no code changes are made between test runs.
* Tests should not contain time bombs. Meaning they should not be susceptible to failure given
- Tests should have consistent results when no code changes are made between test runs.
- Tests should not contain time bombs. Meaning they should not be susceptible to failure given
execution during a given datetime or due to time comparisons.
* Tests should avoid using _magical_ time delays when waiting for operations to finish executing.
* Tests should not assume an execution order for asynchronous methods, this may lead to intermittent
- Tests should avoid using _magical_ time delays when waiting for operations to finish executing.
- Tests should not assume an execution order for asynchronous methods, this may lead to intermittent
failures.
* Tests should not depend on objects through weak references, which may be garbage collected during
- Tests should not depend on objects through weak references, which may be garbage collected during
test execution causing intermittent failures.
* Tests should avoid using logical operations, such as `if`, `while`, `for`, and `switch`.
* Decision points are a nexus for introducing bugs. Bugs found in test code erodes trust and
diminishes value.
* Consider splitting up a test or using a data driven test framework before using logical
operations.
* Test CI/CD jobs should avoid pulling the latest language or tooling dependencies. New dependency
- Tests should avoid using logical operations, such as `if`, `while`, `for`, and `switch`.
- Decision points are a nexus for introducing bugs. Bugs found in test code erodes trust and
diminishes value.
- Consider splitting up a test or using a data driven test framework before using logical
operations.
- Test CI/CD jobs should avoid pulling the latest language or tooling dependencies. New dependency
versions can cause unintuitive failures.

**Self-Deterministic**

* No human intervention should be required to conclude if a test has passed or failed.
- No human intervention should be required to conclude if a test has passed or failed.

**Timely**

* Production code should be crafted to be testable, so that writing and maintaining tests doesn't
- Production code should be crafted to be testable, so that writing and maintaining tests doesn't
take an unreasonable amount of time.
* Test code should be written with the same quality standard as production code and subject to the
- Test code should be written with the same quality standard as production code and subject to the
same linters and formatters.
* Consider using a test-first approach when developing production code.
- Consider using a test-first approach when developing production code.

[test_strategy]: /docs/dev/testing.md
133 changes: 133 additions & 0 deletions docs/noob_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# A Gentle Guide for the New Person

This guide presumes that you know what [Merino](intro.md), are familiar with programming in Python 3.12+, and are looking to incorporate a new service.

## Types of Merino Services

Merino has two ways to provide suggestions, _off-line_ (which uses user agent locally stored data provided by Remote Settings) and _on-line_ (which provides more timely data by providing live responses to queries).

_off-line_ data sets are generally smaller, since we have limited storage capacity available. These may use the [`csv_rs_uploader`](../merino/jobs/csv_rs_uploader) command. A good example of this is the []`wikipedia_offline_uploader`](../merino/jobs/wikipedia_offline_uploader) job.

_on-line_ data do not necessarily have the same size restrictions, but are instead constrained by time. These services should return a response in less than 200ms.

<a name="jobs"/>
## Merino Jobs

"Jobs" are various tasks that can be executed by Merino, and are located in the `./merino/jobs` directory. These jobs are invoked by calling `uv run merino-jobs {job_name}`. Running without a `{job_name}` returns a list of available jobs that can be run. For example:

```bash
> uv run merino-jobs

Usage: merino-jobs [OPTIONS] COMMAND [ARGS]...

CLI Entrypoint

╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ wikipedia-indexer Commands for indexing Wikipedia exports into Elasticsearch │
│ navigational-suggestions Command for preparing top domain metadata for navigational suggestions │
│ amo-rs-uploader Command for uploading AMO add-on suggestions to remote settings │
│ csv-rs-uploader Command for uploading suggestions from a CSV file to remote settings │
│ relevancy-csv-rs-uploader Command for uploading domain data from a CSV file to remote settings │
│ geonames-uploader Uploads GeoNames data to remote settings │
│ wiki-offline-uploader Command for uploading wiki suggestions │
│ polygon-ingestion Commands to download ticker logos, upload to GCS, and generate manifest │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

Please note that file paths presume you are in the Project Root directory.

### Ingestion

A significant portion of work involves fetching and normalizing data.

#### Code

##### Jobs

The ingestion applications are stored under `./merino/jobs/` each provider has it's own application, since each provider is slightly different. For consistency, we use [Typer](https://typer.tiangolo.com/tutorial/) to describe the command,

_**TODO**_ Having weird problems defining(?)/activating(?) Options. Not sure how they're supposed to be passed along.

###### Suggest

Suggest operates by exposing a REST like interface. Code is structured so that:

A **Provider** instantiates it's service (see _initialize()_) and handles the incoming HTTP request (see _query()_).
Providers instantiate a **Backend**, which resolves individual datum (See _query(str)_) requests and returns a list of `merino.providers.suggest.base.BaseSuggestion`. The Backend is also responsible for managing and updating the **Manifest** data block (see [Manifest](#manifest)) via the `fetch_manifest_data()` and `build_and_upload_manifest_file()` methods.

##### Configuration

Configurations for the ingestion processes are stored under `./merino/configs` and are sets of TOML files. These include:

- `ci.toml` - Continuous Integration configurations (Use only for CI tasks)
- `default.toml` - Common, core settings. These are over-ridden by the platform specific configurations.
- `development.toml`, etc. - The platform specific configurations to use. These will eventually be replaced by a single, composed `platform.toml`(name TBD).

Validators for the configuration options are stored in the `./merino/configs/__init__.py` file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(suggestion): we should also add that we use (create it locally first) a default.local.toml in merino/configs which is used by the locally running instance of merino via make dev.

default.local.toml is where we store the actual api keys for vendors like accuweather / polygon e.t.c to test against the live endpoints from our local machine. Configs in this file will override configs from default.toml. And, this config file is in the .gitignore so it'll prevent you from committing secrets 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, Cool! I didn't know about that. Thanks!

#### Curated Recommendations

Provides the set of Curated items on the New Tab page. (Probably don't want to go there, ask for help.)

#### Governance

Provides a set of "Circuit breakers" to interrupt long running or over burdensome processes

#### Jobs

This is the set of Merino Jobs that can be either run via cron, or as singletons. See [_Merino Jobs_](#jobs) above.

#### Middleware

The set of functions called on every request/response by the Merino system.

#### Providers

This is where many of the Merino provider APIs are defined. Things often blend between Web and Jobs, so it can be confusing to sort them out.

<a name="manifest" />

##### Manifest

A `Manifest` in this context is the site metadata associated with a given provider. This metadata can include things like the site icon, description, weight, and other data elements (_**TODO**_: Need to understand this data better).

Metadata is generally fetched from the site by a `job`, which may call a `Provider._fetch_manifest()` method to create and upload the data to a GCS bucket. This can be wrapped by the `merino.providers.manifest.backends.protocol.ManifestBackend.fetch()` If needed later by Merino web services, that bucket will be read and the Manifest data used to construct the `Suggestion`.

`Manifest`s contain a list of `Domain`s and a list of partner dictionaries.

`Domain`s are:

- **rank**: unique numeric ranking for this item.
- **domain**: the host domain without extension (e.g. for `example.com` the domain would be `example`)
- **categories**: a list of business categories for this domain (**TODO**: where are these defined?)
- **url**: the main site URL
- **title**: site title or brief description
- **icon**: URL to the icon stored in CDN
- **serp_categories**: list of numeric category codes (defined by `merino.providers.suggest.base.Category`)
- **similars**: [Optional] Similar words or common misspellings.

Partners are a set of dictionaries that contain values about **TODO**: ???. The dictionaries may specify values such as:

- **"domain"**: the host name of the partner (e.g. `example.com`)
- **"url"**: preferred URL to the partner
- **"original_icon_url"**: non-cached, original source URL for the icon.
- **"gcs_icon_url"**: URL to the icon stored in CDN

It's important to note that the `Manifest` is a [Pydantic BaseModel](https://docs.pydantic.dev/latest/api/base_model/), and as such, the elements are not directly accessible.

##### Suggest

Each `Provider` has specific code relating to how the data should be fetched and displayed. Categories of providers can be gathered under a group to take advantage of python subclassing. Once created, the provider can be included in the Merino suggestion groups by updating `merino.providers.suggest.manager._create_provider()`.

See `merino.providers.skeleton` for a general use template that modules could use.

#### Scripts

Simple utility scripts that may be useful.

#### Tests

The bank of tests (Unit and Integration) to validate code changes. All code changes should include appropriate tests.
5 changes: 5 additions & 0 deletions merino/jobs/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
from merino.jobs.wikipedia_offline_uploader import wiki_offline_uploader_cmd
from merino.jobs.polygon import cli as polygon_ingestion_cmd

# Include your new jobs module here.

# NOTE: `pretty_exceptions_show_locals` argument is set to False to avoid api_key and secrets exposure.
cli = typer.Typer(no_args_is_help=True, add_completion=False, pretty_exceptions_show_locals=False)

Expand All @@ -36,6 +38,9 @@
# Add the polygon ingest subcommand
cli.add_typer(polygon_ingestion_cmd, no_args_is_help=True)

# Describe this command and link to it.
# cli.add_typer(skeleton_cmd, no_args_is_help=True)


@cli.callback()
def setup():
Expand Down
Loading
Loading