Skip to content
View dextermayhewjd's full-sized avatar
  • University of Bristol
  • Bristol
  • 03:22 (UTC)

Block or report dextermayhewjd

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
dextermayhewjd/README.md

Hi there, I'm Dexter Ding (Jiahua DIng) 👋

AI Infra Engineer focused on Large Language Model (LLM) Pre&Post-training , inference optimization. Currently developing expertise in all training tech stack, GPU kernel optimization (CUDA/Triton). Combines solid ML theory, PyTorch engineering, and system-level optimization to build scalable, high-efficiency AI solutions.

"Building a complete stack understanding from model math to GPU execution pipeline, and from training to inferencing"

Languages and Tools:

Domain Skills / Tools
LLM Training PyTorch, LLaMA-Factory,GQA,SFT,GRPO
Inference Optimization vLLM, CUDA, Tesor Core, Nsight Systems
GPU & Systems CUDA C++, PTX profiling, memory hierarchy tuning, cuBLAS,
Efficient Computing Pruning, Quantization, Knowledge Distillation, Kernel Fusion


⚡ GitHub Stats

Pinned Loading

  1. Build-a-Large-Language-Model-From-Scratch- Build-a-Large-Language-Model-From-Scratch- Public

    Forked from stanford-cs336/assignment1-basics

    学生版手搓Stanford CS336 a1 - Language Modeling From Scratch 配有学习笔记 完成了BPE分词器到transformer中各个module的构建 中间使用mapreduce减少了oom的触发 配合编写的单机训练job 系统 配合omegaconfig 完成消融实验

    Python

  2. nano_vllm_v2 nano_vllm_v2 Public

    项目描述:从零实现一个面向 Qwen3-8B(8B 参数,GQA 架构)的高性能 LLM 推理引擎,聚焦 vLLM 核心优化技术的原理复现,在单张 RTX 3090 上实现 1100+ tok/s 的总吞吐

    Python

  3. -CoT-SFT-EI-GRPO -CoT-SFT-EI-GRPO Public

    围绕"如何让语言模型学会链式推理(Chain-of-Thought)",在 Qwen2.5-Math-1.5B 上从零实现并对比三条后训练路线

    Python

  4. advanced-hpc-lbm advanced-hpc-lbm Public

    Forked from UoB-HPC/advanced-hpc-lbm

    COMS30006 - Advanced High Performance Computing - Lattice Boltzmann

    C 2

  5. AI-Systems-C-LLM-Infra AI-Systems-C-LLM-Infra Public

    AI Systems & C++ / LLM Infra 学习课表(长期规划) 本仓库用于系统性学习 C++ / Systems / Parallel Computing / LLM Systems / AI Infra 目标是具备 独立实现 LLM 推理与训练基础设施(Inference / Training Infra) 的能力。

    Python 1

  6. learn_cuda-triton learn_cuda-triton Public

    包含了从pmpp 的学习 以及单机测试矩阵乘法速度 以及对于tensor core的探索

    C++