Project Coconut is a professional, modular AI ecosystem designed for high-performance production deployments. It bridges the gap between raw models and scalable services with its unique 5-layer architecture.
- Hardware Auto-Sensing: Automatically detects NVIDIA GPUs and applies FP16 precision. Falls back to optimized FP32 on CPUs.
- Semantic Inference Cache: Verified 99% latency reduction for repeat queries. Responds in <1s by reusing previous model logic.
- Dynamic Feature Toggles: Enable/Disable RAG, Semantic Caching, and Session Memory instantly via environment variables.
- Auto-MLOps (CI/CD): Integrated GitHub Actions with Trivy security scanning and automated Docker Hub deployment.
Follow this exact sequence to go from a clean server to a production-ready AI backend.
Option A: Local Build (Developer Mode)
git clone https://github.com/Omdeepb69/Cococnut-container.git
cd coconut
docker compose up --build -dOption B: Docker Cloud (Production Mode)
# Pull the pre-built S-Tier image
docker pull omdeep22/coconut_can:latestThe API is locked by default. Generate your first production key:
# This creates a 'pro' tier key with 100 req/min limit
curl -X POST "http://localhost:8000/generate-key?tier=pro"Important
Save the returned api_key immediately. It is hashed for security and cannot be shown again.
# Replace YOUR_KEY with the key from Step 2
curl -X POST "http://localhost:8000/chat" \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is Project Coconut?", "session_id": "init_test"}'Project Coconut's S-Tier Engine is built on the industry-standard AutoModelForCausalLM and AutoTokenizer frameworks. This means you can swap the "Brain" with almost any model on Hugging Face simply by changing the MODEL_ID.
- Mistral / Mixtral:
mistralai/Mistral-7B-Instruct-v0.2 - Llama 3 / 2:
meta-llama/Meta-Llama-3-8B-Instruct - Gemma:
google/gemma-7b-it - Falcon:
tiiuae/falcon-7b-instruct - GPT-2 / Neo:
gpt2(Great for low-resource testing)
The harness uses apply_chat_template, which automatically detects and applies the correct prompt format (System/User/Assistant) for whichever model you choose. No manual prompt engineering required!
Tip
Resource Planning: Large models (7B+ parameters) require significant VRAM/RAM. Ensure your server has 16GB+ RAM for 7B models on CPU, or 24GB+ VRAM for GPU deployment.
If you are using the image omdeep22/coconut_can without the full repository, use these commands to control the engine.
# Start a Redis dependency first
docker run -d --name redis-brain -p 6380:6379 redis/redis-stack:latest
# Launch the Coconut Can
docker run -d \
--name coconut-engine \
-p 8000:8000 \
-e REDIS_HOST=host.docker.internal \
-e REDIS_PORT=6380 \
omdeep22/coconut_can:latest# Run ingestion inside the active container
docker exec -it coconut-engine python3 ingest.py "The secret verification code is COCO-99."docker run -d \
--name coconut-gpu \
--gpus all \
-p 8000:8000 \
-e DEVICE=cuda \
omdeep22/coconut_can:latest| Task | Command |
|---|---|
| Follow Live AI Logic | docker logs -f coconut-engine |
| Check Resource Usage | docker stats coconut-engine |
| Enter Shell (Debug) | docker exec -it coconut-engine bash |
| Inspect Env Config | `docker inspect coconut-engine |
| Reset Knowledge | docker exec redis-stack redis-cli FT.DROPINDEX coconut_idx |
| Clear Logic Cache | docker exec redis-stack redis-cli FT.DROPINDEX coconut_cache_idx |
| Debug Auth Keys | docker exec redis-stack redis-cli KEYS "api_key:*" |
| Clean Deep Exit | docker system prune -a --volumes |
Project Coconut is designed to be modified without changing code. You can inject these as environment variables (-e).
To swap the "Brain", set the MODEL_ID:
docker run -e MODEL_ID=gpt2 omdeep22/coconut_can:latestENABLE_RAG=False: Disables knowledge lookup.ENABLE_CACHE=False: Disables semantic reuse (forces fresh AI generation every time).ENABLE_MEMORY=False: Disables session history.CACHE_THRESHOLD=0.90: Makes the semantic cache harder to hit (for more precise matches).
Run 7B+ models on consumer GPUs (T4, 3060) by enabling quantization:
docker run -e LOAD_IN_4BIT=True omdeep22/coconut_canUse the new streaming endpoint for a high-end, typing-effect UI:
curl -X POST "http://localhost:8000/chat/stream" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello Gonyai!"}'For hyper-scale deployments, use the provided Kubernetes manifests to orchestrate a resilient cluster.
Apply all manifests in the k8s/ directory to launch the API, Redis cluster, and Auto-scaler:
kubectl apply -f k8s/If traffic spikes beyond the HPA's reaction time, scale manually:
kubectl scale deployment coconut-api --replicas=50- Pods Status:
kubectl get pods -l app=coconut - Auto-scaling events:
kubectl get hpa coconut-hpa - Service URL:
kubectl get svc coconut-service
Use these essential commands to manage and secure your Gonyai Engine deployment.
Run the AI engine with a secure Admin Root Key and 4-bit Quantization.
docker run -d \
-p 8000:8000 \
-e ADMIN_ROOT_KEY="your_master_key_2026" \
-e LOAD_IN_4BIT=True \
-e DEVICE=cuda \
omdeep22/coconut:latest# 1. Apply Secrets (Edit k8s/secrets.yaml first)
kubectl apply -f k8s/secrets.yaml
# 2. Launch the Stack
kubectl apply -f k8s/Since the API is secured, use your ADMIN_ROOT_KEY to generate keys for your users.
curl -X POST "http://localhost:8000/generate-key?tier=pro" \
-H "X-API-KEY: your_master_key_2026"Monitor the "Brain" and track real-time cache hits/latency.
- Readiness:
curl http://localhost:8000/health/ready - Metrics:
curl http://localhost:8000/metrics
curl -N -X POST "http://localhost:8000/chat/stream" \
-H "X-API-KEY: user_api_key" \
-H "Content-Type: application/json" \
-d '{"prompt": "Give me a 5-step plan for global scale."}'