namehash · djstrong · Mar 12, 2025 · Feb 20, 2025 · Mar 12, 2025 · Mar 12, 2025
diff --git a/docs/media/domain-detail.dark.svg b/docs/media/domain-detail.dark.svg
diff --git a/docs/media/domain-detail.light.svg b/docs/media/domain-detail.light.svg
diff --git a/docs/media/full.dark.svg b/docs/media/full.dark.svg
diff --git a/docs/media/full.light.svg b/docs/media/full.light.svg
diff --git a/docs/media/generators.dark.svg b/docs/media/generators.dark.svg
diff --git a/docs/media/generators.light.svg b/docs/media/generators.light.svg
diff --git a/docs/media/instant.dark.svg b/docs/media/instant.dark.svg
diff --git a/docs/media/instant.light.svg b/docs/media/instant.light.svg
diff --git a/docs/media/logo-light.svg b/docs/media/logo-light.svg
diff --git a/docs/old_readme.md b/docs/old_readme.md
@@ -0,0 +1,202 @@
+![Tests](https://github.com/namehash/namegraph/actions/workflows/ci.yml/badge.svg?branch=master)
+
+# Install
+
+```
+pip3 install -e .
+```
+
+# Usage
+
+The application can be run:
+* reading queries from stdin
+* reading query as an argument
+
+Additional resources need to be downloaded:
+```
+python download.py # dictionaries, embeddings
+python download_names.py
+```
+
+## Queries from stdin
+
+```
+python ./generator/app.py app.input=stdin
+```
+
+## Query as an argument
+
+The application can be run with
+
+```
+python ./generator/app.py
+```
+
+It will generate suggestions for the default query.
+
+The default parameters are defined in `conf/config.yaml`. Any of the parameters might be substituted with a path to the
+parameter, with dot-separated fragments, e.g.
+
+```
+python ./generator/app.py app.query=firepower
+```
+
+will substitute the default query with the provided one.
+
+The parameters are documented in the config.
+
+# REST API
+
+Start server:
+```
+python -m uvicorn web_api:app --reload
+```
+
+Query with POST:
+```
+curl -d '{"label":"fire"}' -H "Content-Type: application/json" -X POST http://localhost:8000
+```
+
+# Tests
+
+Run:
+```
+pytest
+```
+or without slow tests:
+```
+pytest -m "not slow"
+```
+
+## Debugging
+
+Run app with `app.logging_level=DEBUG` to see debug information:
+```
+python generator/app.py app.input=stdin app.logging_level=DEBUG
+```
+
+# Deployment
+
+## Build Docker image locally
+
+Set image TAG:
+
+`export TAG=0.1.0`
+
+Build a Docker image locally
+
+`docker compose -f docker-compose.build.yml build`
+
+Authorize to Amazon (if you are using MFA you have to take temporary ACCESS keys from AWS STS):
+
+`aws configure`
+
+Authorize to ECR:
+
+`./authorize-ecr.sh`
+
+Push image to ECR:
+
+`docker push 571094861812.dkr.ecr.us-east-1.amazonaws.com/name-generator:${TAG}
+
+## Deploy image on remote instance
+
+Set image TAG:
+
+`export TAG=0.1.0
+
+Authorize EC2 instance in ECR:
+
+`aws ecr get-login-password | docker login --username AWS --password-stdin 571094861812.dkr.ecr.us-east-1.amazonaws.com/name-generator`
+
+(Re-Deploy) image:
+
+`docker compose up -d`
+
+Check if service works:
+
+`curl -d '{"label":"firestarter"}' -H "Content-Type: application/json" -X POST http://44.203.61.202`
+
+## Learning-To-Rank
+
+To access the LTR features, you need to configure it in the Elasticsearch instance (see [here](https://github.com/namehash/collection-templates/tree/master/research/learning-to-rank/readme.md) for more details).
+
+## Pipelines, weights, sampler
+
+In `conf/prod_config_new.yaml` are defined `generator_limits` which limits maximum number of suggestions generated by each generator. This is for optimization. E.g.:
+```yaml
+  generator_limits:
+    HyphenGenerator: 128
+    AbbreviationGenerator: 128
+    EmojiGenerator: 150
+    Wikipedia2VGenerator: 100
+    RandomAvailableNameGenerator: 20000
+```
+
+In `conf/pipelines/prod_new.yaml` are defined pipelines. Each pipeline have:
+* a `name`
+* one `generator`
+* list of `filters`, e.g. SubnameFilter, ValidNameFilter, ValidNameLengthFilter, DomainFilter
+* `weights` for each interpretation type (`ngram`, `person`, `other`) and each language
+* `mode_weights_multiplier` - a multiplier of above weights for each mode (e.g. `instant`, `domain_detail`, `full`)
+* `global_limits` for each mode, which can be integer (absolute number) or float (percentage of all results); also you can override values for `grouped_by_category` endpoint by adding prefix `grouped_` (e.g. `grouped_instant`, `grouped_domain_detail`, `grouped_full`)
+
+Setting `0` in `mode_weights_multiplier` or `global_limits` disables the pipeline in a given mode.
+
+### Sampler
+
+Each request defines:
+* `mode`
+* `min_suggestions`
+* `max_suggestions`
+* `min_available_fraction`
+
+A name can have many interpretations. Every interpretation has type (`ngram`, `person`, `other`) and language. Every interpretation have a probability. There might be more than one interpretation with the same type and language.
+
+For each pair of type and language, probabilities of each pipeline are computed.
+
+1. If there is enough suggestions then break.
+2. If all pipeline probabilities for every pair of type nad language are 0 then break.
+3. Sample type and language, then sample interpretation within this type and language.
+4. Sample a pipeline for the sampled interpretation. The first pass of sampling is without replacement to increase diversity in top suggestions. 
+5. If the pipeline exceeds its global limit then go to 4.
+6. Get a suggestion from the pipeline. (The generator is executed here). If there is no more suggestions then go to 4.
+7. If the suggestion have been already sampled then go to 6.
+8. If the suggestion is not available and there is room only for available then go to 6.
+9. If the suggestion is not normalized then go to 6.
+10. Go to 1.
+
+Exhausted pipelines are removed from sampling.
+
+### Grouped by category
+
+Parameters:
+* `mode`
+* `min_available_fraction`
+* max number of categories
+* max number of suggestions per category
+* max related categories
+* min total categories?
+* max total categories?
+
+Requirements:
+* order of categories is fixed
+* every generator must be mapped to only one category
+* flag generator suggestion should appear in 10% of suggestions - maybe we should detect if it is first search by a user
+  * should we remove first pass of sampling with every generator?
+
+1. Shuffle order of categories (using weights?) if min number of categories is smaller than all categories. If some category does not return suggestions then we take the next one.
+2. Within each category: sample type and lang of interpretation, sample interpretaion with this type and lang. Sample pipeline (weights of pipelines depends on type and language. Do it in parallel?
+3. Sample `max number of suggestions per category`. How handle `min_available_fraction`?
+
+### Suggestions by category
+
+For each category MetaSampler is created with appropriate pipelines.
+In parallel, all MetaSamplers are exectuted. In one MetaSampler:
+1. Apply global limits.
+2. For each interpretation (interpretation_type, lang, tokenization) a sampler is created.
+
+
+After generation of suggestions for all categories:
+1. For each category number of suggestions is limited by category's `max_suggestions`.
+2. If `count_real_suggestions` < `min_total_suggestions` then RandomAvailable names are appended as `other` category.