Skip to content
View anviit's full-sized avatar
  • 02:45 (UTC -12:00)

Block or report anviit

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
anviit/README.md

Anvit More — ML Systems · GPU Kernels · Applied RL

ML engineer focused on the systems layer: how models run fast, how decisions get made under uncertainty, how inference holds up in production.


What I build

GPU kernel engineering — custom Triton kernels for LLM primitives, benchmarked against PyTorch baselines on real hardware.

Kernel Speedup Peak throughput
Fused Bias + GELU 14.65× 172 GB/s
FlashAttention (T=2048) 2.52× 11.4 GB/s
Fused AdamW (50M params) 3.45× 177 GB/s
Inference attention (B=2) 3.94× 95 GB/s

triton-llm-kernels — RMSNorm, LayerNorm, FlashAttention, fused AdamW, inference attention. Every kernel validated against fp32 reference, benchmarked with triton.testing.do_bench.


Production LLM inference — async serving stack on a 6 GB GPU, built from first principles.

Metric Value
TTFT P50 28 ms
Decode speed 39.4 tok/s (~85% of memory bandwidth)
Cache hit latency P50 2 ms
Cache hit rate 81%
Success rate @ concurrency=10 100%

llm-inference-serving — FastAPI gateway → Redis cache → FP16 PyTorch → RTX 4050L. Fused Triton attention kernel, asyncio-locked GPU access, fire-and-forget cache writes.


Reinforcement learning for real-time decisions — physics-informed simulation + PPO agent for F1 race strategy. The same architecture applies to ADAS planning, EV energy management, and hybrid powertrain arbitration.

Agent E[Position] E[Points]
Rule-based baseline (1-stop M→H) 3.09 15.8
PPO agent 1.00 25.0

+58% points vs baseline. Monte Carlo planner runs at 870 rollouts/second on a single CPU core.

autonomous-strategy-engine — physics-informed tyre/fuel/weather models, 10k–100k MC rollouts, PPO on 8-dim sensor observation, 27 passing tests.


Stack

Python · Triton · PyTorch · CUDA · FastAPI · Redis · Docker · Stable-Baselines3 · NumPy · scikit-learn

Production experience: LoRA/QLoRA fine-tuning · Whisper ASR · RAG (FAISS, Pinecone) · Gemini Vision · medical NLP


Background

MSc Data Science — University of Edinburgh (2024)
Currently: ML & AI Engineer @ Plus91 Technology, Pune
Target: ML Systems / LLM Inference / Automotive AI — open to relocate to Germany, Switzerland, Poland, Norway, Finland

📧 anviit22@gmail.com · LinkedIn

Pinned Loading

  1. autonomous-strategy-engine autonomous-strategy-engine Public

    F1 race strategy AI — PPO agent beats rule-based baseline by 58% using physics simulation and Monte Carlo planning

    Python

  2. llm-inference-serving llm-inference-serving Public

    Production LLM inference stack — 28ms TTFT, 39 tok/s, 81% cache hit rate on a 6GB GPU

    Python

  3. triton-llm-kernels triton-llm-kernels Public

    LLM primitives rebuilt in Triton — FlashAttention 2.52×, fused AdamW 3.45×, Bias+GELU 14.65× faster than PyTorch

    Python

  4. phi3-triton-decode phi3-triton-decode Public

    Custom Triton decode attention kernel integrated end-to-end into Phi-3 Mini. Preallocated KV cache delivers 2.5× speedup over HF baseline — profiler shows attention is 0.65% of CUDA time. RTX 4060 …

    Python

  5. triton-int8-inference triton-int8-inference Public

    Triton causal attention kernel with INT8 quantization — 17.23× faster than PyTorch on RTX 4060 (6 GB VRAM)

    Python

  6. attractor-sim attractor-sim Public

    Interactive 3D strange attractor simulator — 10 chaotic systems (Lorenz, Rössler, Aizawa, Chen, Halvorsen, Dadras, Sprott B, Newton-Leipnik, Rabinovich-Fabrikant, Cygnus X-1) with RK4 integration, …

    TypeScript