-
Notifications
You must be signed in to change notification settings - Fork 16
docs: Improve docs for new contributors #1038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jrconlin
wants to merge
19
commits into
main
Choose a base branch
from
docs/DISCO-3666_startup_guide
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 8 commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
7d40057
docs: Improve docs for new contributors
jrconlin bfe1cf3
f nightly
jrconlin d6c454a
f nightly / pick up spares
jrconlin a10e11b
f describe manifest structure and content.
jrconlin b16b9db
Merge branch 'main' of github.com:mozilla-services/merino-py into doc…
jrconlin 2108566
f fix typing
jrconlin 6421d1e
f fmt & ruff
jrconlin 7b1afa2
f add test, clean-up
jrconlin 4c2dc2f
f nightly (r's)
jrconlin 8903ddf
feat: Add SportsData as a a provider.
jrconlin 5c5ca31
f nightly checkin
jrconlin d63389c
f remove sportsdata cruft
jrconlin e2f1de4
Merge branch 'main' of github.com:mozilla-services/merino-py into doc…
jrconlin 6b1951e
Merge branch 'main' of github.com:mozilla-services/merino-py into doc…
jrconlin d3acdbf
f updates
jrconlin f57c350
Merge branch 'main' of github.com:mozilla-services/merino-py into doc…
jrconlin c36a32c
f update and cleanup
jrconlin f168488
f add job unit test skeleton
jrconlin 21876a4
f fix ref to provider skeleton
jrconlin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| # A Gentle Guide for the New Person | ||
|
|
||
| This guide presumes that you know what [Merino](intro.md), are familiar with programming in Python 3.12+, and are looking to incorporate a new service. | ||
|
|
||
| ## Types of Merino Services | ||
|
|
||
| Merino has two ways to provide suggestions, _off-line_ (which uses user agent locally stored data provided by Remote Settings) and _on-line_ (which provides more timely data by providing live responses to queries). | ||
|
|
||
| _off-line_ data sets are generally smaller, since we have limited storage capacity available. These may use the [`csv_rs_uploader`](../merino/jobs/csv_rs_uploader) command. A good example of this is the []`wikipedia_offline_uploader`](../merino/jobs/wikipedia_offline_uploader) job. | ||
|
|
||
| _on-line_ data do not necessarily have the same size restrictions, but are instead constrained by time. These services should return a response in less than 200ms. | ||
|
|
||
| <a name="jobs"/> | ||
| ## Merino Jobs | ||
|
|
||
| "Jobs" are various tasks that can be executed by Merino, and are located in the `./merino/jobs` directory. These jobs are invoked by calling `uv run merino-jobs {job_name}`. Running without a `{job_name}` returns a list of available jobs that can be run. For example: | ||
|
|
||
| ```bash | ||
| > uv run merino-jobs | ||
|
|
||
| Usage: merino-jobs [OPTIONS] COMMAND [ARGS]... | ||
|
|
||
| CLI Entrypoint | ||
|
|
||
| ╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────╮ | ||
| │ --help Show this message and exit. │ | ||
| ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ | ||
| ╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────╮ | ||
| │ wikipedia-indexer Commands for indexing Wikipedia exports into Elasticsearch │ | ||
| │ navigational-suggestions Command for preparing top domain metadata for navigational suggestions │ | ||
| │ amo-rs-uploader Command for uploading AMO add-on suggestions to remote settings │ | ||
| │ csv-rs-uploader Command for uploading suggestions from a CSV file to remote settings │ | ||
| │ relevancy-csv-rs-uploader Command for uploading domain data from a CSV file to remote settings │ | ||
| │ geonames-uploader Uploads GeoNames data to remote settings │ | ||
| │ wiki-offline-uploader Command for uploading wiki suggestions │ | ||
| │ polygon-ingestion Commands to download ticker logos, upload to GCS, and generate manifest │ | ||
| ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ | ||
| ``` | ||
|
|
||
| Please note that file paths presume you are in the Project Root directory. | ||
|
|
||
| ### Ingestion | ||
|
|
||
| A significant portion of work involves fetching and normalizing data. | ||
|
|
||
| #### Code | ||
|
|
||
| ##### Jobs | ||
|
|
||
| The ingestion applications are stored under `./merino/jobs/` each provider has it's own application, since each provider is slightly different. For consistency, we use [Typer](https://typer.tiangolo.com/tutorial/) to describe the command, | ||
|
|
||
| _**TODO**_ Having weird problems defining(?)/activating(?) Options. Not sure how they're supposed to be passed along. | ||
|
|
||
| ###### Suggest | ||
|
|
||
| Suggest operates by exposing a REST like interface. Code is structured so that: | ||
|
|
||
| A **Provider** instantiates it's service (see _initialize()_) and handles the incoming HTTP request (see _query()_). | ||
| Providers instantiate a **Backend**, which resolves individual datum (See _query(str)_) requests and returns a list of `merino.providers.suggest.base.BaseSuggestion`. The Backend is also responsible for managing and updating the **Manifest** data block (see [Manifest](#manifest)) via the `fetch_manifest_data()` and `build_and_upload_manifest_file()` methods. | ||
|
|
||
| ##### Configuration | ||
|
|
||
| Configurations for the ingestion processes are stored under `./merino/configs` and are sets of TOML files. These include: | ||
|
|
||
| - `ci.toml` - Continuous Integration configurations (Use only for CI tasks) | ||
| - `default.toml` - Common, core settings. These are over-ridden by the platform specific configurations. | ||
| - `development.toml`, etc. - The platform specific configurations to use. These will eventually be replaced by a single, composed `platform.toml`(name TBD). | ||
|
|
||
| Validators for the configuration options are stored in the `./merino/configs/__init__.py` file | ||
|
|
||
| #### Curated Recommendations | ||
|
|
||
| Provides the set of Curated items on the New Tab page. (Probably don't want to go there, ask for help.) | ||
|
|
||
| #### Governance | ||
|
|
||
| Provides a set of "Circuit breakers" to interrupt long running or over burdensome processes | ||
|
|
||
| #### Jobs | ||
|
|
||
| This is the set of Merino Jobs that can be either run via cron, or as singletons. See [_Merino Jobs_](#jobs) above. | ||
|
|
||
| #### Middleware | ||
|
|
||
| The set of functions called on every request/response by the Merino system. | ||
|
|
||
| #### Providers | ||
|
|
||
| This is where many of the Merino provider APIs are defined. Things often blend between Web and Jobs, so it can be confusing to sort them out. | ||
|
|
||
| <a name="manifest" /> | ||
|
|
||
| ##### Manifest | ||
|
|
||
| A `Manifest` in this context is the site metadata associated with a given provider. This metadata can include things like the site icon, description, weight, and other data elements (_**TODO**_: Need to understand this data better). | ||
|
|
||
| Metadata is generally fetched from the site by a `job`, which may call a `Provider._fetch_manifest()` method to create and upload the data to a GCS bucket. This can be wrapped by the `merino.providers.manifest.backends.protocol.ManifestBackend.fetch()` If needed later by Merino web services, that bucket will be read and the Manifest data used to construct the `Suggestion`. | ||
|
|
||
| `Manifest`s contain a list of `Domain`s and a list of partner dictionaries. | ||
|
|
||
| `Domain`s are: | ||
|
|
||
| - **rank**: unique numeric ranking for this item. | ||
| - **domain**: the host domain without extension (e.g. for `example.com` the domain would be `example`) | ||
| - **categories**: a list of business categories for this domain (**TODO**: where are these defined?) | ||
| - **url**: the main site URL | ||
| - **title**: site title or brief description | ||
| - **icon**: URL to the icon stored in CDN | ||
| - **serp_categories**: list of numeric category codes (defined by `merino.providers.suggest.base.Category`) | ||
| - **similars**: [Optional] Similar words or common misspellings. | ||
|
|
||
| Partners are a set of dictionaries that contain values about **TODO**: ???. The dictionaries may specify values such as: | ||
|
|
||
| - **"domain"**: the host name of the partner (e.g. `example.com`) | ||
| - **"url"**: preferred URL to the partner | ||
| - **"original_icon_url"**: non-cached, original source URL for the icon. | ||
| - **"gcs_icon_url"**: URL to the icon stored in CDN | ||
|
|
||
| It's important to note that the `Manifest` is a [Pydantic BaseModel](https://docs.pydantic.dev/latest/api/base_model/), and as such, the elements are not directly accessible. | ||
|
|
||
| ##### Suggest | ||
|
|
||
| Each `Provider` has specific code relating to how the data should be fetched and displayed. Categories of providers can be gathered under a group to take advantage of python subclassing. Once created, the provider can be included in the Merino suggestion groups by updating `merino.providers.suggest.manager._create_provider()`. | ||
|
|
||
| See `merino.providers.skeleton` for a general use template that modules could use. | ||
|
|
||
| #### Scripts | ||
|
|
||
| Simple utility scripts that may be useful. | ||
|
|
||
| #### Tests | ||
|
|
||
| The bank of tests (Unit and Integration) to validate code changes. All code changes should include appropriate tests. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(suggestion): we should also add that we use (create it locally first) a
default.local.tomlinmerino/configswhich is used by the locally running instance of merino viamake dev.default.local.tomlis where we store the actual api keys for vendors like accuweather / polygon e.t.c to test against the live endpoints from our local machine. Configs in this file will override configs fromdefault.toml. And, this config file is in the.gitignoreso it'll prevent you from committing secrets 😄There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, Cool! I didn't know about that. Thanks!