SSPCloud open data chatbot

Running the app in local

TBD

Evaluating best performing model

We use MLFlow to centralize all training performance. To reproduce our examples in your MLFlow store, use the following instructions:

Build the database (remove --max_pages 20 if you want to build the whole database)

python run_build_database.py --max_pages 20 --experiment_name "BUILD_CHROMA_TEST"

Evaluate model performance. If MLFlow has been used in previous example and you know the run id (see below for cases where you don't know it), you can use the following

python run_evaluation.py --experiment_name BUILD_CHROMA_TEST --config_mlflow ${your_mlflow_run_id_here}

Parameters

The project parameters can be specified in the following ways. Each overrides the ones before.

Default values are read from src/config/rag.toml
A custom config file can be provided using the RAG_CONFIG_FILE environment variable

RAG_CONFIG_FILE=/home/onyxia/work/myconfig.toml python XXX.py

Environment variables prefixed with RAG_ are interpreted as parameters

RAG_LLM_TEMPERATURE=0.3 python XXX.py

A custom config file can also be provided using the --config_file command line argument

python XXX.py --config_file /home/onyxia/work/myconfig.toml

Some parameters can be specified using command line arguments. Depending on the application, not all parameters may be specified this way. Refer to python XXX.py --help to know which flags are available.

python XXX.py --emb_device cuda

Finally parameters will be loaded from a previous MLFlow run if the mlflow_run_id and mlflow_tracking_uri are both properly set using any combination of the the previous methods.

RAG_MLFLOW_TRACKING_URI=ZZZZZZ python XXX.py --rag_mlflow_run_id YYYY

These loaded parameters override all otherwise specified values.

Build complete INSEE dataset based on parquet files stored in S3 bucket (Need S3 credential and SSP Cloud Access)

cd llm-open-data-insee pip install -r requirements.txt pre-commit install

python src/db_building/insee_data_processing.py

To load a first version of Vectorial Database from S3 bucket

mc cp -r s3/projet-llm-insee-open-data/data/chroma_database/chroma_db/ data/chroma_db

Name		Name	Last commit message	Last commit date
Latest commit History 299 Commits
.github/workflows		.github/workflows
argo-workflows-templates		argo-workflows-templates
bash		bash
deployment		deployment
dev		dev
doc		doc
legacy		legacy
logging		logging
prompt		prompt
public		public
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
entrypoint.sh		entrypoint.sh
giskard.ipynb		giskard.ipynb
insee_data_processing.py		insee_data_processing.py
pyproject.toml		pyproject.toml
renovate.json		renovate.json
requirements.txt		requirements.txt
run_build_database.py		run_build_database.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SSPCloud open data chatbot

Running the app in local

Evaluating best performing model

Parameters

Build complete INSEE dataset based on parquet files stored in S3 bucket (Need S3 credential and SSP Cloud Access)

To load a first version of Vectorial Database from S3 bucket

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

InseeFrLab/llm-open-data-insee

Folders and files

Latest commit

History

Repository files navigation

SSPCloud open data chatbot

Running the app in local

Evaluating best performing model

Parameters

Build complete INSEE dataset based on parquet files stored in S3 bucket (Need S3 credential and SSP Cloud Access)

To load a first version of Vectorial Database from S3 bucket

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages