Ansible Log Analysis Quick Start

Welcome to the Ansible log analysis Quick Start! a system that automatically detects errors, classifies them by authorization level, and generates intelligent step-by-step solutions. Our system eliminates manual log searching and reduces resolution time by routing issues to the appropriate experts.

Problem We Solve

The Challenge: Organizations running Ansible automation at scale face significant challenges when errors occur. Log analysis is manual, time-consuming, and requires specialized knowledge across multiple domains (AWS, Kubernetes, networking, etc.). When failures happen, teams spend valuable time searching through logs, identifying the right experts, and waiting for solutions.

Our Solution: An AI-powered log analysis system that automatically:

Detects and categorizes Ansible errors in real-time
Routes issues to appropriate experts based on authorization levels
Provides contextual, step-by-step solutions using AI agents
Learns from historical resolutions to improve future recommendations

Current Manual Process

A human analyst is:

Searching for error logs.
Talk with the person who is authorized with the credentials to solve the problem:
- Examples:
  AWS provisioning failed requires talking with the AWS person who is authorized.
  Bug in the playbook source code - talk with the programmer.
The authenticated person needs to understand how to solve the problem.
Solve the problem.

Our Solution Stack

Loki - as a log database.
Alloy/Promtail - log ingestion and label definer.
OpenShiftAI - model serving, data science pipeline, notebooks.
Backend:
- FASTAPI - for api endpoints.
- Langchain.
- LangGraph - for building the agentic workflow.
- PostgreSQL.
- Sentence Transformers - generating embeddings.
UI:
- Gradio (for now)
Annotation interface: an interface that is used for evaluation and workflow improvement
- Gradio

High-Level Solution

Data is being ingested from the Red Hat Ansible Automation Platform (AAP) clusters, using Alloy or Promtail, into Loki (a time series database designed for logs).
An error log is alerted using a Grafana alert and sent into the agentic workflow.
The agentic workflow processes the log and stores the processed data into a PostgreSQL database.
The log analyst using the UI interacts with the logs and gets suggestions on how to solve the error, depending on their authorization.

Agentic Workflow:

Step 1: Embedding and Clustering

Many logs are generated from the same log template. To group them, we embed a subset of each log, then cluster all the embeddings into groups. Each group represents a log template. For example, let’s look at the following three logs:

1. error: user id 10 already exits.
2. error: user id 15 already exits.
3. error: password of user itayk is wrong.

As we can see here, logs 1 and 2 are from the same template, and we want to group them together.

Then the user will be able to filter by templates.

Step 2: Summary and Expert Classification per Log Template

For each log template, create a summary of the log and classify it by authorization.
For example, an analyst who has AWS authentication will filter by their authorization and will see only relevant error summaries in the UI.

Step 3: Creating a step-by-step solution

We will have a router that will determine if we need more context to solve the problem or if the log error alone is sufficient to generate the step-by-step solution.
If we need more context, we will spin up an agent that will accumulate context as needed by using the following:

Loki MCP, which is able to query the log database for additional log context.
RAG for retrieving an error cheat sheet of already solved questions.
Ansible MCP for obtaining code source data to suggest a better solution.

Step 4: Store the data

Store a payload of the generated values for each log in a PostgreSQL database.

Training and Inference stages

Currently, the only difference between the training and inference stages is the clustering algorithm.

Training

Train the clustering algorithm to cluster the logs by log-template.

Inference

Load the trained clustering model.

User Interface

Each expert selects their rule, dependent on their authorization. Current rules are:
- Kubernetes / OpenShift Cluster Admins
- DevOps / CI/CD Engineers (Ansible + Automation Platform)
- Networking / Security Engineers
- System Administrators / OS Engineers
- Application Developers / GitOps / Platform Engineers
- Identity & Access Management (IAM) Engineers
- Other / Miscellaneous
Each expert can filter by labels (cluster_name, log_file_name, …)
A summary of each log is listed to the expert, the expert can click on the log summary and view the whole log, and a step-by-step solution, timestamp, and labels

After selecting the authorization class "expert":

Annotation Interface

For improving our agentic workflow, context PDFs, and other context we need to understand the errors. To do so, we have a data annotation interface for annotating Ansible error log pipeline outputs,
Where we see the agentic workflow:

Input of the left (error log)
Outputs in the center (summary, and step-by-step solution)
Annotation window on the right.

See the interface below:

Requirements

Software Requirements

For Production Cluster Deployment

OpenShift Cluster
Helm
oc CLI (for OpenShift)

Minimum Hardware Requirements

Production Cluster Environment

Scalability Considerations

- **GPU** for faster embedding.

Deployment

The Ansible Log Monitor can be deployed in multiple environments depending on your needs. Choose the deployment method that best fits your requirements:

Mock Data (Temporary for Development)

To use add data during development, add your log files to the data/logs/failed directory.

Each log should be saved as a separate .txt file (e.g., <filename>.txt). For example data/logs/failed/example.txt

Quick Start - Local Development

For development and testing, you can run all services locally using the provided Makefile:

Prerequisites

Docker and Docker Compose
uv package manager with Python 3.12+
Make (for running deployment commands)
Make sure you have added the mock data as described in the ### Mock Data (Temporary for Development) section.

Deploy Locally

Follow these steps to set up and run the Ansible Log Monitor on your local development environment:

1. Clone and Setup Repository

# Clone the repository
git clone <repository-url>
cd ansible-logs

# Install Python dependencies using uv
uv sync

2. Configure Environment Variables

# Copy the environment template and configure your settings
cp .env.example .env

# Edit .env with your API keys and configuration:
# - OPENAI_API_ENDPOINT: VLLM (OpenAI) compitable endpoint (some endpoint need to add /v1 as suffix)
# - OPENAI_API_TOKEN: your token to the endpoint
# - OPENAI_MODEL: Model to use (e.g., Granite-3.3-8B-Instruct	)
# - LANGSMITH_API_KEY: Optional, for LangSmith tracing

3. Start All Services In short:

make local/start
make local/run-whole-training-pipeline

# Launch all services in the background
make local/start

# Run the complete training pipeline (do it after local/start)
make local/run-whole-training-pipeline

# Perform status check to see which services are running
make local/status

# Stop all services when done
make local/stop

Additional Commands

# Restart all services
make local/restart

# View all available local commands
make local/help

Deploy on the Cluster

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
annotation_interface		annotation_interface
config		config
data		data
deploy		deploy
figures		figures
knowledge_base		knowledge_base
services/clustering		services/clustering
src/alm		src/alm
ui		ui
.containerignore		.containerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Containerfile		Containerfile
Makefile		Makefile
README.md		README.md
init_pipeline.py		init_pipeline.py
pdf_reader.py		pdf_reader.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ansible Log Analysis Quick Start

Table of Contents

Problem We Solve

Current Manual Process

Our Solution Stack

High-Level Solution

Agentic Workflow:

Step 1: Embedding and Clustering

Step 2: Summary and Expert Classification per Log Template

Step 3: Creating a step-by-step solution

Step 4: Store the data

Training and Inference stages

Training

Inference

User Interface

Annotation Interface

Requirements

Software Requirements

For Production Cluster Deployment

Minimum Hardware Requirements

Production Cluster Environment

Scalability Considerations

Deployment

Mock Data (Temporary for Development)

Quick Start - Local Development

Prerequisites

Deploy Locally

Deploy on the Cluster

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

RHEcosystemAppEng/ansible-log-analysis

Folders and files

Latest commit

History

Repository files navigation

Ansible Log Analysis Quick Start

Table of Contents

Problem We Solve

Current Manual Process

Our Solution Stack

High-Level Solution

Agentic Workflow:

Step 1: Embedding and Clustering

Step 2: Summary and Expert Classification per Log Template

Step 3: Creating a step-by-step solution

Step 4: Store the data

Training and Inference stages

Training

Inference

User Interface

Annotation Interface

Requirements

Software Requirements

For Production Cluster Deployment

Minimum Hardware Requirements

Production Cluster Environment

Scalability Considerations

Deployment

Mock Data (Temporary for Development)

Quick Start - Local Development

Prerequisites

Deploy Locally

Deploy on the Cluster

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages