I'm an MLOps / Infrastructure engineer focused on LLM inference and GPU serving.
Currently finishing my M.S. in CS at USC. Built a two-level K8s dispatcher for a heterogeneous 700-GPU cluster — session-sticky routing, gRPC state sync, sub-GPU partitioning with HAMi + Kueue. Also worked on CUDA kernel optimization and distributed training pipelines.
Interested in the systems side of AI: how inference actually runs at scale, where the bottlenecks are, and how to push utilization without blowing up memory. Lately spending time in vLLM and SGLang internals — and looking to do more.
Stack · Python · C++ · CUDA · PyTorch · SGLang · Kubernetes · AWS · vLLM · JAX · GCP

