Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/media/domain-detail.dark.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/media/domain-detail.light.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 31 additions & 0 deletions docs/media/full.dark.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 31 additions & 0 deletions docs/media/full.light.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/media/generators.dark.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions docs/media/generators.light.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/media/instant.dark.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/media/instant.light.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions docs/media/logo-light.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
202 changes: 202 additions & 0 deletions docs/old_readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
![Tests](https://github.com/namehash/namegraph/actions/workflows/ci.yml/badge.svg?branch=master)

# Install

```
pip3 install -e .
```

# Usage

The application can be run:
* reading queries from stdin
* reading query as an argument

Additional resources need to be downloaded:
```
python download.py # dictionaries, embeddings
python download_names.py
```

## Queries from stdin

```
python ./generator/app.py app.input=stdin
```

## Query as an argument

The application can be run with

```
python ./generator/app.py
```

It will generate suggestions for the default query.

The default parameters are defined in `conf/config.yaml`. Any of the parameters might be substituted with a path to the
parameter, with dot-separated fragments, e.g.

```
python ./generator/app.py app.query=firepower
```

will substitute the default query with the provided one.

The parameters are documented in the config.

# REST API

Start server:
```
python -m uvicorn web_api:app --reload
```

Query with POST:
```
curl -d '{"label":"fire"}' -H "Content-Type: application/json" -X POST http://localhost:8000
```

# Tests

Run:
```
pytest
```
or without slow tests:
```
pytest -m "not slow"
```

## Debugging

Run app with `app.logging_level=DEBUG` to see debug information:
```
python generator/app.py app.input=stdin app.logging_level=DEBUG
```

# Deployment

## Build Docker image locally

Set image TAG:

`export TAG=0.1.0`

Build a Docker image locally

`docker compose -f docker-compose.build.yml build`

Authorize to Amazon (if you are using MFA you have to take temporary ACCESS keys from AWS STS):

`aws configure`

Authorize to ECR:

`./authorize-ecr.sh`

Push image to ECR:

`docker push 571094861812.dkr.ecr.us-east-1.amazonaws.com/name-generator:${TAG}

## Deploy image on remote instance

Set image TAG:

`export TAG=0.1.0

Authorize EC2 instance in ECR:

`aws ecr get-login-password | docker login --username AWS --password-stdin 571094861812.dkr.ecr.us-east-1.amazonaws.com/name-generator`

(Re-Deploy) image:

`docker compose up -d`

Check if service works:

`curl -d '{"label":"firestarter"}' -H "Content-Type: application/json" -X POST http://44.203.61.202`

## Learning-To-Rank

To access the LTR features, you need to configure it in the Elasticsearch instance (see [here](https://github.com/namehash/collection-templates/tree/master/research/learning-to-rank/readme.md) for more details).

## Pipelines, weights, sampler

In `conf/prod_config_new.yaml` are defined `generator_limits` which limits maximum number of suggestions generated by each generator. This is for optimization. E.g.:
```yaml
generator_limits:
HyphenGenerator: 128
AbbreviationGenerator: 128
EmojiGenerator: 150
Wikipedia2VGenerator: 100
RandomAvailableNameGenerator: 20000
```

In `conf/pipelines/prod_new.yaml` are defined pipelines. Each pipeline have:
* a `name`
* one `generator`
* list of `filters`, e.g. SubnameFilter, ValidNameFilter, ValidNameLengthFilter, DomainFilter
* `weights` for each interpretation type (`ngram`, `person`, `other`) and each language
* `mode_weights_multiplier` - a multiplier of above weights for each mode (e.g. `instant`, `domain_detail`, `full`)
* `global_limits` for each mode, which can be integer (absolute number) or float (percentage of all results); also you can override values for `grouped_by_category` endpoint by adding prefix `grouped_` (e.g. `grouped_instant`, `grouped_domain_detail`, `grouped_full`)

Setting `0` in `mode_weights_multiplier` or `global_limits` disables the pipeline in a given mode.

### Sampler

Each request defines:
* `mode`
* `min_suggestions`
* `max_suggestions`
* `min_available_fraction`

A name can have many interpretations. Every interpretation has type (`ngram`, `person`, `other`) and language. Every interpretation have a probability. There might be more than one interpretation with the same type and language.

For each pair of type and language, probabilities of each pipeline are computed.

1. If there is enough suggestions then break.
2. If all pipeline probabilities for every pair of type nad language are 0 then break.
3. Sample type and language, then sample interpretation within this type and language.
4. Sample a pipeline for the sampled interpretation. The first pass of sampling is without replacement to increase diversity in top suggestions.
5. If the pipeline exceeds its global limit then go to 4.
6. Get a suggestion from the pipeline. (The generator is executed here). If there is no more suggestions then go to 4.
7. If the suggestion have been already sampled then go to 6.
8. If the suggestion is not available and there is room only for available then go to 6.
9. If the suggestion is not normalized then go to 6.
10. Go to 1.

Exhausted pipelines are removed from sampling.

### Grouped by category

Parameters:
* `mode`
* `min_available_fraction`
* max number of categories
* max number of suggestions per category
* max related categories
* min total categories?
* max total categories?

Requirements:
* order of categories is fixed
* every generator must be mapped to only one category
* flag generator suggestion should appear in 10% of suggestions - maybe we should detect if it is first search by a user
* should we remove first pass of sampling with every generator?

1. Shuffle order of categories (using weights?) if min number of categories is smaller than all categories. If some category does not return suggestions then we take the next one.
2. Within each category: sample type and lang of interpretation, sample interpretaion with this type and lang. Sample pipeline (weights of pipelines depends on type and language. Do it in parallel?
3. Sample `max number of suggestions per category`. How handle `min_available_fraction`?

### Suggestions by category

For each category MetaSampler is created with appropriate pipelines.
In parallel, all MetaSamplers are exectuted. In one MetaSampler:
1. Apply global limits.
2. For each interpretation (interpretation_type, lang, tokenization) a sampler is created.


After generation of suggestions for all categories:
1. For each category number of suggestions is limited by category's `max_suggestions`.
2. If `count_real_suggestions` < `min_total_suggestions` then RandomAvailable names are appended as `other` category.
Loading