Skip to content

Commit e209785

Browse files
feat(pipeline): ottimizzazione completa con dynamic batching, hash cache O(1) e Tika server obbligatorio
Unificato Dockerfile rimuovendo varianti tika, ora Tika server sempre attivo all'avvio container eliminando cold start 5-10s per file, implementato batching dinamico basato su token budget invece di batch size fisso riducendo chiamate API da 334 a 50-100 per 1000 chunk, aggiunto index file_hash su Qdrant per query O(1) invece di scroll O(N) con cache in-memory per sessione
1 parent 4a34b84 commit e209785

File tree

13 files changed

+463
-469
lines changed

13 files changed

+463
-469
lines changed

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,36 @@ and the project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.
77

88
## [Unreleased]
99

10+
## [2.0.0] - 2025-12-12
11+
12+
### Breaking Changes
13+
- **Unica immagine Docker**: rimossi `Dockerfile.tika` e `Dockerfile.tika.local`, ora esiste solo `Dockerfile` con Tika integrato
14+
- **Tika sempre obbligatorio**: rimosso flag `--no-tika` e logica opzionale, Tika server sempre attivo
15+
- **Tag immagine semplificato**: usare `ghcr.io/strawberry-code/ragify:latest` (rimosso suffisso `-tika`)
16+
17+
### Added
18+
- **Dynamic batching**: nuovo sistema di batching basato su token budget invece di batch size fisso
19+
- **EMBEDDING_TOKEN_BUDGET**: nuova env var (default 1800) per controllare token massimi per batch
20+
- **Index file_hash**: creazione automatica index su Qdrant per query O(1) invece di scroll O(N)
21+
- **FileHashCache**: cache in-memory per evitare query ripetute durante indicizzazione
22+
- **Tika server mode**: Tika avviato come server all'avvio container (porta 9998), elimina cold start 5-10s per file
23+
24+
### Changed
25+
- `EMBEDDING_BATCH_SIZE` default aumentato da 3 a 20 (ora funziona con token budget)
26+
- Health check verifica anche Tika server oltre a API, Ollama e Qdrant
27+
- `check_file_hash_in_qdrant()` usa `count()` O(1) invece di `scroll()` O(N)
28+
- Entrypoint avvia Qdrant → Ollama → Tika → API in sequenza
29+
30+
### Removed
31+
- `Dockerfile.tika` e `Dockerfile.tika.local`
32+
- Flag `--no-tika` e `--non-interactive` da CLI
33+
- Logica condizionale `use_tika` in pipeline e API
34+
35+
### Performance
36+
- Riduzione chiamate embedding API: ~334 → ~50-100 per 1000 chunk
37+
- Eliminato cold start Tika: 5-10s → 0s per file
38+
- Hash check O(1) con index invece di O(N) scroll
39+
1040
## [1.3.2] - 2025-12-04
1141

1242
### Fixed

Dockerfile

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Ragify All-in-One Container
2+
# Includes Ollama, Qdrant, Apache Tika, and Python API
23
# Multi-stage build for Docker/Podman
34

45
# ============================================
@@ -26,16 +27,24 @@ RUN pip install --no-cache-dir --upgrade pip && \
2627
FROM python:3.12-slim
2728

2829
LABEL maintainer="Ragify"
29-
LABEL description="All-in-one RAG documentation search with Ollama, Qdrant, and MCP"
30-
LABEL version="1.0.0"
30+
LABEL description="All-in-one RAG documentation search with Ollama, Qdrant, MCP, and Apache Tika"
31+
LABEL version="2.0.0"
3132

32-
# Install runtime dependencies
33+
# Install runtime dependencies including Java for Tika server
3334
RUN apt-get update && apt-get install -y --no-install-recommends \
3435
tini \
3536
curl \
3637
ca-certificates \
38+
openjdk-21-jre-headless \
3739
&& rm -rf /var/lib/apt/lists/*
3840

41+
# Set Java environment (create symlink to avoid architecture-specific path)
42+
RUN ln -s /usr/lib/jvm/java-21-openjdk-* /usr/lib/jvm/java-21
43+
ENV JAVA_HOME=/usr/lib/jvm/java-21
44+
ENV PATH="${JAVA_HOME}/bin:${PATH}"
45+
# Tika JAR path (used by entrypoint.sh to start server)
46+
ENV TIKA_JAR_PATH=/tmp/tika-server.jar
47+
3948
# Install Ollama - pinned to v0.11.0 to avoid embedding bugs in 0.12.x/0.13.x
4049
# See: https://github.com/ollama/ollama/issues/13054
4150
ENV OLLAMA_VERSION=0.11.0
@@ -78,13 +87,40 @@ COPY docker/ ./docker/
7887
# Make scripts executable
7988
RUN chmod +x /app/docker/*.sh
8089

81-
# Pre-pull default Ollama model (makes container larger but faster startup)
82-
# This runs ollama serve temporarily to pull the model
90+
# Pre-pull default Ollama model
8391
RUN ollama serve & \
8492
sleep 5 && \
8593
ollama pull nomic-embed-text && \
8694
pkill ollama || true
8795

96+
# Pre-download Tika JAR and ensure it's in the expected location
97+
RUN python3 <<'PYEOF'
98+
import os
99+
import glob
100+
from tika import parser
101+
102+
# Trigger download (this downloads JAR to /tmp/)
103+
parser.from_buffer(b'test', xmlContent=False)
104+
105+
# Find the JAR file (handles versioned names like tika-server-standard-3.1.0.jar)
106+
expected_path = '/tmp/tika-server.jar'
107+
if os.path.exists(expected_path):
108+
print(f'Tika JAR found at {expected_path}')
109+
else:
110+
# Search for versioned JAR
111+
jars = glob.glob('/tmp/tika-server*.jar')
112+
if jars:
113+
print(f'Found JAR: {jars[0]}')
114+
os.symlink(jars[0], expected_path)
115+
print(f'Created symlink: {expected_path} -> {jars[0]}')
116+
else:
117+
raise Exception('No Tika JAR found in /tmp/')
118+
119+
# Final verification
120+
assert os.path.exists(expected_path), f'Tika JAR not found at {expected_path}'
121+
print(f'Tika JAR verified at {expected_path}')
122+
PYEOF
123+
88124
# Create data directory for Qdrant
89125
RUN mkdir -p /data/qdrant
90126

@@ -100,7 +136,9 @@ ENV PYTHONUNBUFFERED=1 \
100136
# Qdrant (internal)
101137
QDRANT_URL=http://localhost:6333 \
102138
QDRANT_PATH=/data/qdrant \
103-
# Auth (must be provided)
139+
# Tika server (internal - started by entrypoint.sh)
140+
TIKA_SERVER_ENDPOINT=http://localhost:9998 \
141+
# Auth (must be provided for production)
104142
AUTH_CONFIG="" \
105143
GITHUB_CLIENT_ID="" \
106144
GITHUB_CLIENT_SECRET="" \
@@ -112,12 +150,12 @@ VOLUME ["/data"]
112150
# Expose ports
113151
EXPOSE 8080 6666
114152

115-
# Health check
116-
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
153+
# Health check (verifies API, Ollama, Qdrant, Tika)
154+
HEALTHCHECK --interval=30s --timeout=10s --start-period=90s --retries=3 \
117155
CMD /app/docker/healthcheck.sh
118156

119157
# Use tini as init
120158
ENTRYPOINT ["/usr/bin/tini", "--"]
121159

122-
# Run entrypoint script
160+
# Run entrypoint script (starts Qdrant, Ollama, Tika, API)
123161
CMD ["/app/docker/entrypoint.sh"]

Dockerfile.tika

Lines changed: 0 additions & 160 deletions
This file was deleted.

0 commit comments

Comments
 (0)