Skip to content

Collection of public datasets from the French administration, vectorized and ready to use in AI projects.

License

Notifications You must be signed in to change notification settings

etalab-ia/mediatech

Repository files navigation

MEDIATECH

License French version Hugging Face collection

📝 Description

This project processes public data made available by various administrations in order to facilitate access to vectorized and ready-to-use public data for AI applications in the public sector. It includes scripts for downloading, processing, embedding, and inserting this data into a PostgreSQL database, and facilitates its export via various means.

💡 Get Started

𖣘 Method 1 : Airflow

Installing and configuring dependencies

  1. Run the initial deployment script:

    sudo chmod +x ./scripts/initial_deployment.sh
    ./scripts/initial_deployment.sh
  2. Set up the environment variables in a .env file based on the example in .env.example.

    The AIRFLOW_UID variable must be obtained by executing:

    echo $(id -u)

    The JWT_TOKEN variable will be obtained later by using the Airflow API. Just leave it for now.

Initialize Airflow and PostgreSQL (PgVector) containers

  1. Run the containers_deployment script :

    sudo chmod +x ./scripts/containers_deployment.sh
    ./scripts/containers_deployment.sh
  2. Set up the environment variables in a .env file based on the example in .env.example.

  3. Export .env variables :

    export $(grep -v '^#' .env | xargs)
  4. Make sure to remove the PostgreSQL (PgVector) volume:

    docker compose down -v

    ⚠️ This operation will delete all volumes !

  5. Use the Airflow API to obtain the JWT_TOKEN variable:

    curl -X 'POST' \
    'http://localhost:8080/auth/token' \
    -H 'Content-Type: application/json' \
    -d "{\"username\": \"${_AIRFLOW_WWW_USER_USERNAME}\", \"password\": \"${_AIRFLOW_WWW_USER_PASSWORD}\"}"
  6. Define the JWT_TOKEN variable in the .env file with the obtained access_token.

  7. Define the full_pipeline_schedule Airflow variable to set the execution schedule for the full_pipeline DAG whether:

  • By executing the bash command :
    docker exec -it airflow-scheduler airflow variables set full_pipeline_schedule "0 19 * * 5"

    The cron expression "0 19 * * 5" schedules the DAG to run every Friday at 19:00 (7:00 PM). Replace the cron expression with your desired schedule or None.

  • From the Airflow UI : Admin > Variable > + Add Variable

Optional : Configure Tchap logging

To receive real-time notifications about DAG execution (start, success, failure) in a Tchap room, you need to configure an Apprise connection in Airflow.

If you don't want to, you can just remove the following lines in each DAG located in airflow_config/dags/ :

  on_execute_callback=get_start_notifier(),
  on_success_callback=get_success_notifier(),
  on_failure_callback=get_failure_notifier(),

Otherwise :

  1. Navigate to the Airflow UI (usually http://localhost:8080).

  2. Go to Admin > Connections.

  3. Click the + icon to add a new record.

  4. Fill in the connection form with the following details:

    • Connection Id: TchapNotifier
    • Connection Type: Apprise
    • Extra fields > config: Construct the Apprise URL for Matrix using your environment variables, following this format:
      {"path": "matrixs://<TCHAP_ACCOUNT_TOKEN>@<TCHAP_SERVER>/<TCHAP_ROOM_TOKEN>/?format=markdown", "tag": "alerts"}
      
      • Replace <TCHAP_ACCOUNT_TOKEN> with the value from your .env file.
      • Replace <TCHAP_SERVER> with the server hostname from your .env file (e.g., matrix.agent.dinum.tchap.gouv.fr, without the https:// prefix).
      • Replace <TCHAP_ROOM_TOKEN> with the room ID from your .env file.
  5. Click Save.

Airflow will now use this connection to send formatted notifications to your specified Tchap room.

Downloading, Processing and Uploading Data

You are now ready to use Airflow and execute DAGs that are available. Each dataset has its own DAG and the DAG FULL_PIPELINE is defined to manage all datasets DAGs and their execution order.

</> Method 2 : Use local CLI

Installing Dependencies

  1. Install the required apt dependencies:

    sudo apt-get update
    sudo apt-get install -y $(cat config/requirements-apt-container.txt)
  2. Create and activate a virtual environment:

    python3 -m venv .venv  # Create the virtual environment
    source .venv/bin/activate  # Activate the virtual environment
  3. Install the required python dependencies:

    pip install -e .

Installing in development mode (-e) allows you to use the mediatech command and modify the code without reinstalling.

Note: Make sure your environment is properly configured before continuing.

PostgreSQL (PgVector) Database Configuration

  1. Set up the environment variables in a .env file based on the example in .env.example.

  2. Export .env variables :

    export $(grep -v '^#' .env | xargs)
  3. Start the PostgreSQL container with Docker:

    docker compose up -d postgres
  4. Check that the pgvector_container container is running:

    docker ps

Downloading, Processing and Uploading Data

Using the mediatech Command

After installation, the mediatech command is available globally and replaces python main.py:

If you encounter issues with the mediatech command, you can still use python main.py instead.

The main.py file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.
You can use it as follows:

mediatech <command> [options]

or

python main.py <command> [options]

Command examples:

  • View help:
    mediatech --help
  • Create PostgreSQL tables:
    mediatech create_tables --model BAAI/bge-m3
  • Download all files listed in data_config.json:
    mediatech download_files --all
  • Download files from the service_public source:
    mediatech download_files --source service_public
  • Download and process all files listed in data_config.json:
    mediatech download_and_process_files --all --model BAAI/bge-m3
  • Process all data:
    mediatech process_files --all --model BAAI/bge-m3
  • Split a table into subtables based on different criteria (see main.py):
    mediatech split_table --source legi
  • Export PostgreSQL tables to parquet files:
    mediatech export_tables --output data/parquet
  • Upload parquet datasets to the Hugging Face repository:
    mediatech upload_dataset --input data/parquet/service_public.parquet --dataset-name service-public

Run mediatech --help in your terminal to see all available options, or check the code directly in main.py.

Alternative Usage with python main.py

If you prefer to use the Python script directly, you can always use:

python main.py <command> [options]

Examples:

python main.py download_files
python main.py create_tables --model BAAI/bge-m3
python main.py process_files --all --model BAAI/bge-m3
Using the update.sh Script

The update.sh script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.
To run it, execute the following command from the project root:

./scripts/update.sh

This script will:

  • Wait for the PostgreSQL database to be available,
  • Create or update the necessary tables in the PostgreSQL database,
  • Download public files listed in data_config.json,
  • Process and vectorize the data,
  • Export the tables in Parquet format,
  • Upload the Parquet files to Hugging Face.

🗂️ Project Structure

  • main.py: Main entry point to run the complete pipeline via CLI.
  • pyproject.toml: Python project and dependency configuration.
  • Dockerfile: Defines the instructions to build the custom Docker image for Airflow, installing system dependencies, Python packages, and setting up the project environment.
  • docker-compose.yml: Orchestrates the multi-container setup, defining Airflow services and the PostgreSQL (PgVector) database.
  • .github/: Contains GitHub Actions workflows for Continuous Integration and Continuous Deployment (CI/CD), automating testing and deployment processes.
  • download_and_processing/: Contains scripts to download and extract files.
  • database/: Contains scripts to manage the database (table creation, data insertion).
  • docs/: Contains various documentation resources and tutorials.
  • utils/: Contains utility functions shared across modules.
  • config/: Contains project configuration scripts.
  • logs/: Contains log files to track scripts execution.
  • scripts/: Contains all shell scripts, executed either automatically or manually in some cases.
  • airflow_config: Contains all files related to Apache Airflow, including DAG definitions (dags/), configuration (config/), logs (logs/), and plugins (plugins/). This is where the data orchestration pipelines are defined and managed.

⚖️ License

This project is licensed under the MIT License.

About

Collection of public datasets from the French administration, vectorized and ready to use in AI projects.

Topics

Resources

License

Stars

Watchers

Forks

Contributors