Chengxiang Qi KuangjuX

Chengxiang Qi (齐呈祥)

M.Eng. Student in Computer Technology @ University of Chinese Academy of Sciences (UCAS)
B.Eng. in Computer Science @ Tianjin University (Outstanding Thesis Award)

About Me

I am a final-year master's student at the University of Chinese Academy of Sciences, focusing on ML Systems, Deep Learning Compilers, and GPU Programming. Previously, I worked extensively with the Rust programming language on systems-level projects including operating systems and hypervisors.

Research Interests

Machine Learning Systems · Deep Learning Compilers · GPU Kernel Optimization · CUDA Programming · Operating Systems · Virtualization

Experience

WeChat — LLM Infra Team · ML System Intern · June 2025 – Present

Implemented Light-DuoAttention using CuTeDSL for efficient long-context inference, integrated and running within SGLang.
Explored NVSHMEM & DeepEP; built NVSHMEM-Tutorial with hybrid CUDA IPC / RDMA communication for internal technical sharing.
Implemented Ring Attention Forward based on ThunderKittens using the LCF template, outperforming ring-flash-attention on short sequences. Implemented Flash Attention Backward based on LCF.
Performed performance analysis on MagiAttention, ZigZag Ring Attention, and ZigZag Flex Attention.
Investigated DSL design on NVIDIA Hopper architecture.

Microsoft Research Asia — System & Network Group · Research Intern · Feb 2024 – May 2025

Based on the FractalTensor programming model, optimized GEMM, Back-to-Back GEMMs, Stacked/Dilated LSTM, and FlashAttention-2 using CUTLASS. Achieved up to 5.45× speedup over SOTA on NVIDIA A100, with 2.14× average acceleration.
As a core designer and developer, built TileFusion: an efficient C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
Mentored by Dr. Ying Cao. Co-first authored a paper published at SOSP'24.

Tsinghua University — OS Laboratory · Research Intern · May 2023 – July 2023

Wrote an Intel 82599 NIC driver in Rust (referencing DPDK for optimization) and integrated it into ArceOS. Performed network performance benchmarking and optimization.
Developed a Type-2 hypervisor based on ArceOS capable of booting Linux; built Hypercraft as a standalone VMM library.

Selected Projects

Project	Description	Stars
microsoft/TileFusion	C++ macro kernel template library for tile processing across GPU memory hierarchy with TensorCore support
microsoft/FractalTensor	Programming framework for organizing DNN data as nested statically-shaped tensors with automatic compiler analysis
NVSHMEM-Tutorial	Build a DeepEP-like GPU communication buffer with NVSHMEM (hybrid CUDA IPC / RDMA)
xv6-rust	Reimplementation of MIT xv6-riscv in Rust; reference implementation for OSCOMP
arceos	Experimental modular OS in Rust — contributed hypervisor, ixgbe NIC driver, and network optimization
Hypercraft	VMM library in Rust for RISC-V / AArch64 virtualization, capable of booting Linux
hypocaust-2	Hardware-assisted RISC-V hypervisor using H Extension; boots rCore, RT-Thread, and Linux

Publications

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor Siran Liu*, Chengxiang Qi*, Ying Cao, Chao Yang, Weifang Hu, Xuanhua Shi, Fan Yang, Mao Yang ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP'24) · (*equal contribution) [Paper] [Code]
基于 RISC-V 的 Type-1 Hypervisor 的设计与实现 Chengxiang Qi Bachelor Thesis, Tianjin University · (Outstanding Thesis Award) [Code]

Talks

Hypocaust: a RISC-V Type-1 Hypervisor Written in Rust — OS2ATC 2022, Beijing (March 2023) Presentation on the design and implementation of a RISC-V Type-1 hypervisor, showcasing virtualization techniques and system-level Rust programming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chengxiang Qi KuangjuX

Achievements

Achievements

Organizations

Block or report KuangjuX