MEDIATECH

📝 Description

This project processes public data made available by various administrations in order to facilitate access to vectorized and ready-to-use public data for AI applications in the public sector. It includes scripts for downloading, processing, embedding, and inserting this data into a PostgreSQL database, and facilitates its export via various means.

💡 Get Started

𖣘 Method 1 : Airflow

Installing and configuring dependencies

Run the initial deployment script:

sudo chmod +x ./scripts/initial_deployment.sh
./scripts/initial_deployment.sh

Set up the environment variables in a .env file based on the example in .env.example.

The AIRFLOW_UID variable must be obtained by executing:
```
echo $(id -u)
```
The JWT_TOKEN variable will be obtained later by using the Airflow API. Just leave it for now.

Initialize Airflow and PostgreSQL (PgVector) containers

Run the containers_deployment script :

sudo chmod +x ./scripts/containers_deployment.sh
./scripts/containers_deployment.sh

Set up the environment variables in a .env file based on the example in .env.example.
Export .env variables :
```
export $(grep -v '^#' .env | xargs)
```
Make sure to remove the PostgreSQL (PgVector) volume:
```
docker compose down -v
```
⚠️ This operation will delete all volumes !

Use the Airflow API to obtain the JWT_TOKEN variable:

curl -X 'POST' \
'http://localhost:8080/auth/token' \
-H 'Content-Type: application/json' \
-d "{\"username\": \"${_AIRFLOW_WWW_USER_USERNAME}\", \"password\": \"${_AIRFLOW_WWW_USER_PASSWORD}\"}"

Define the JWT_TOKEN variable in the .env file with the obtained access_token.
Define the full_pipeline_schedule Airflow variable to set the execution schedule for the full_pipeline DAG whether:

By executing the bash command :
```
docker exec -it airflow-scheduler airflow variables set full_pipeline_schedule "0 19 * * 5"
```
The cron expression "0 19 * * 5" schedules the DAG to run every Friday at 19:00 (7:00 PM). Replace the cron expression with your desired schedule or None.
From the Airflow UI : Admin > Variable > + Add Variable

Optional : Configure Tchap logging

To receive real-time notifications about DAG execution (start, success, failure) in a Tchap room, you need to configure an Apprise connection in Airflow.

If you don't want to, you can just remove the following lines in each DAG located in airflow_config/dags/ :

  on_execute_callback=get_start_notifier(),
  on_success_callback=get_success_notifier(),
  on_failure_callback=get_failure_notifier(),

Otherwise :

Navigate to the Airflow UI (usually http://localhost:8080).
Go to Admin > Connections.
Click the + icon to add a new record.
Fill in the connection form with the following details:
- Connection Id: TchapNotifier
- Connection Type: Apprise
- Extra fields > config: Construct the Apprise URL for Matrix using your environment variables, following this format:
```
{"path": "matrixs://<TCHAP_ACCOUNT_TOKEN>@<TCHAP_SERVER>/<TCHAP_ROOM_TOKEN>/?format=markdown", "tag": "alerts"}
```
  - Replace <TCHAP_ACCOUNT_TOKEN> with the value from your .env file.
  - Replace <TCHAP_SERVER> with the server hostname from your .env file (e.g., matrix.agent.dinum.tchap.gouv.fr, without the https:// prefix).
  - Replace <TCHAP_ROOM_TOKEN> with the room ID from your .env file.
Click Save.

Airflow will now use this connection to send formatted notifications to your specified Tchap room.

Downloading, Processing and Uploading Data

You are now ready to use Airflow and execute DAGs that are available. Each dataset has its own DAG and the DAG FULL_PIPELINE is defined to manage all datasets DAGs and their execution order.

</> Method 2 : Use local CLI

Installing Dependencies

Install the required apt dependencies:

sudo apt-get update
sudo apt-get install -y $(cat config/requirements-apt-container.txt)

Create and activate a virtual environment:

python3 -m venv .venv  # Create the virtual environment
source .venv/bin/activate  # Activate the virtual environment

Install the required python dependencies:
```
pip install -e .
```

Installing in development mode (-e) allows you to use the mediatech command and modify the code without reinstalling.

Note: Make sure your environment is properly configured before continuing.

PostgreSQL (PgVector) Database Configuration

Set up the environment variables in a .env file based on the example in .env.example.
Export .env variables :
```
export $(grep -v '^#' .env | xargs)
```
Start the PostgreSQL container with Docker:
```
docker compose up -d postgres
```
Check that the pgvector_container container is running:
```
docker ps
```

Downloading, Processing and Uploading Data

Using the `mediatech` Command

After installation, the mediatech command is available globally and replaces python main.py:

If you encounter issues with the mediatech command, you can still use python main.py instead.

The main.py file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.
You can use it as follows:

mediatech <command> [options]

or

python main.py <command> [options]

Command examples:

View help:
```
mediatech --help
```

Create PostgreSQL tables:

mediatech create_tables --model BAAI/bge-m3

Download all files listed in data_config.json:
```
mediatech download_files --all
```

Download files from the service_public source:

mediatech download_files --source service_public

Download and process all files listed in data_config.json:

mediatech download_and_process_files --all --model BAAI/bge-m3

Process all data:

mediatech process_files --all --model BAAI/bge-m3

Split a table into subtables based on different criteria (see main.py):
```
mediatech split_table --source legi
```

Export PostgreSQL tables to parquet files:

mediatech export_tables --output data/parquet

Upload parquet datasets to the Hugging Face repository:

mediatech upload_dataset --input data/parquet/service_public.parquet --dataset-name service-public

Run mediatech --help in your terminal to see all available options, or check the code directly in main.py.

Alternative Usage with `python main.py`

If you prefer to use the Python script directly, you can always use:

python main.py <command> [options]

Examples:

python main.py download_files
python main.py create_tables --model BAAI/bge-m3
python main.py process_files --all --model BAAI/bge-m3

Using the `update.sh` Script

The update.sh script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.
To run it, execute the following command from the project root:

./scripts/update.sh

This script will:

Wait for the PostgreSQL database to be available,
Create or update the necessary tables in the PostgreSQL database,
Download public files listed in data_config.json,
Process and vectorize the data,
Export the tables in Parquet format,
Upload the Parquet files to Hugging Face.

🗂️ Project Structure

main.py: Main entry point to run the complete pipeline via CLI.
pyproject.toml: Python project and dependency configuration.
Dockerfile: Defines the instructions to build the custom Docker image for Airflow, installing system dependencies, Python packages, and setting up the project environment.
docker-compose.yml: Orchestrates the multi-container setup, defining Airflow services and the PostgreSQL (PgVector) database.
.github/: Contains GitHub Actions workflows for Continuous Integration and Continuous Deployment (CI/CD), automating testing and deployment processes.
download_and_processing/: Contains scripts to download and extract files.
database/: Contains scripts to manage the database (table creation, data insertion).
docs/: Contains various documentation resources and tutorials.
- docs/hugging_face_rag_tutorial.ipynb: RAG Tutorial: How to load MediaTech's datasets from Hugging Face and use them in a RAG pipeline ?
- docs/reconstruct_vector_database.ipynb: Tutorial: How to reconstruct a dataset without chunking and embedding from MediaTech parquet files uploaded to Hugging Face?
- docs/fr/: Contains all documentation resources and tutorials translated into French.
utils/: Contains utility functions shared across modules.
config/: Contains project configuration scripts.
logs/: Contains log files to track scripts execution.
scripts/: Contains all shell scripts, executed either automatically or manually in some cases.
- scripts/update.sh: Shell script to run the entire data processing pipeline.
- scripts/periodic_update.sh: Shell script to automate the pipeline on the virtual machine. This script is executed periodically by cron_config.txt.
- scripts/backup.sh: Shell script to back up the Pgvector (PostgreSQL) volume and some configuration files. This script is executed periodically by cron_config.txt.
- scripts/restore.sh: Shell script to restore the Pgvector (PostgreSQL) volume and configuration files if needed.
- scripts/initial_deployment.sh: Sets up a new server environment by installing Docker, Docker Compose, and other system dependencies.
- scripts/containers_deployment.sh: Manages the application's lifecycle by building, initializing, and deploying the Docker containers as defined in docker-compose.yml. It must be executed after each update of the Mediatech CLI or other script not shared with the Airflow container, as defined in docker-compose.yml.
- scripts/check_running_dags.sh: Checks the Airflow API to see if any data pipelines (DAGs) are currently running, used to safely lock the deployment process.
- scripts/delete_old_files.sh: Shell script to automatically delete old files from severals folders such as logs/, airflow_config/logs and backups/. It keeps files from the last X days and deletes older ones. This script can be run manually or scheduled via cron to keep the folders clean.
- scripts/manage_checkpoint.sh : Script shell permettant de gérer les différents fichiers de points de contrôle pour le traitement des fichiers.
- scripts/write_tchap_message.sh: Sends a formatted message to a specified Tchap room. It takes the message content as an argument and uses environment variables for authentication and destination.
airflow_config: Contains all files related to Apache Airflow, including DAG definitions (dags/), configuration (config/), logs (logs/), and plugins (plugins/). This is where the data orchestration pipelines are defined and managed.

⚖️ License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MEDIATECH

📝 Description

💡 Get Started

𖣘 Method 1 : Airflow

Installing and configuring dependencies

Initialize Airflow and PostgreSQL (PgVector) containers

Optional : Configure Tchap logging

Downloading, Processing and Uploading Data

</> Method 2 : Use local CLI

Installing Dependencies

PostgreSQL (PgVector) Database Configuration

Downloading, Processing and Uploading Data

Using the `mediatech` Command

Alternative Usage with `python main.py`

Using the `update.sh` Script

🗂️ Project Structure

⚖️ License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
airflow_config		airflow_config
config		config
database		database
docs		docs
download_and_processing		download_and_processing
scripts		scripts
utils		utils
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml

License

etalab-ia/mediatech

Folders and files

Latest commit

History

Repository files navigation

MEDIATECH

📝 Description

💡 Get Started

𖣘 Method 1 : Airflow

Installing and configuring dependencies

Initialize Airflow and PostgreSQL (PgVector) containers

Optional : Configure Tchap logging

Downloading, Processing and Uploading Data

</> Method 2 : Use local CLI

Installing Dependencies

PostgreSQL (PgVector) Database Configuration

Downloading, Processing and Uploading Data

Using the mediatech Command

Alternative Usage with python main.py

Using the update.sh Script

🗂️ Project Structure

⚖️ License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

Using the `mediatech` Command

Alternative Usage with `python main.py`

Using the `update.sh` Script