🏠 Homepage • 📝 Zhihu • 💻 GitHub
M.Eng. Student in Computer Technology @ University of Chinese Academy of Sciences (UCAS)
B.Eng. in Computer Science @ Tianjin University (Outstanding Thesis Award)
I am a final-year master's student at the University of Chinese Academy of Sciences, focusing on ML Systems, Deep Learning Compilers, and GPU Programming. Previously, I worked extensively with the Rust programming language on systems-level projects including operating systems and hypervisors.
Machine Learning Systems · Deep Learning Compilers · GPU Kernel Optimization · CUDA Programming · Operating Systems · Virtualization
WeChat — LLM Infra Team · ML System Intern · June 2025 – Present
- Implemented Light-DuoAttention using CuTeDSL for efficient long-context inference, integrated and running within SGLang.
- Explored NVSHMEM & DeepEP; built NVSHMEM-Tutorial with hybrid CUDA IPC / RDMA communication for internal technical sharing.
- Implemented Ring Attention Forward based on ThunderKittens using the LCF template, outperforming ring-flash-attention on short sequences. Implemented Flash Attention Backward based on LCF.
- Performed performance analysis on MagiAttention, ZigZag Ring Attention, and ZigZag Flex Attention.
- Investigated DSL design on NVIDIA Hopper architecture.
Microsoft Research Asia — System & Network Group · Research Intern · Feb 2024 – May 2025
- Based on the FractalTensor programming model, optimized GEMM, Back-to-Back GEMMs, Stacked/Dilated LSTM, and FlashAttention-2 using CUTLASS. Achieved up to 5.45× speedup over SOTA on NVIDIA A100, with 2.14× average acceleration.
- As a core designer and developer, built TileFusion: an efficient C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
- Mentored by Dr. Ying Cao. Co-first authored a paper published at SOSP'24.
Tsinghua University — OS Laboratory · Research Intern · May 2023 – July 2023
- Wrote an Intel 82599 NIC driver in Rust (referencing DPDK for optimization) and integrated it into ArceOS. Performed network performance benchmarking and optimization.
- Developed a Type-2 hypervisor based on ArceOS capable of booting Linux; built Hypercraft as a standalone VMM library.
| Project | Description | Stars |
|---|---|---|
| microsoft/TileFusion | C++ macro kernel template library for tile processing across GPU memory hierarchy with TensorCore support | |
| microsoft/FractalTensor | Programming framework for organizing DNN data as nested statically-shaped tensors with automatic compiler analysis | |
| NVSHMEM-Tutorial | Build a DeepEP-like GPU communication buffer with NVSHMEM (hybrid CUDA IPC / RDMA) | |
| xv6-rust | Reimplementation of MIT xv6-riscv in Rust; reference implementation for OSCOMP | |
| arceos | Experimental modular OS in Rust — contributed hypervisor, ixgbe NIC driver, and network optimization | |
| Hypercraft | VMM library in Rust for RISC-V / AArch64 virtualization, capable of booting Linux | |
| hypocaust-2 | Hardware-assisted RISC-V hypervisor using H Extension; boots rCore, RT-Thread, and Linux |
-
Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor Siran Liu*, Chengxiang Qi*, Ying Cao, Chao Yang, Weifang Hu, Xuanhua Shi, Fan Yang, Mao Yang ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP'24) · (*equal contribution) [Paper] [Code]
-
基于 RISC-V 的 Type-1 Hypervisor 的设计与实现 Chengxiang Qi Bachelor Thesis, Tianjin University · (Outstanding Thesis Award) [Code]
- Hypocaust: a RISC-V Type-1 Hypervisor Written in Rust — OS2ATC 2022, Beijing (March 2023) Presentation on the design and implementation of a RISC-V Type-1 hypervisor, showcasing virtualization techniques and system-level Rust programming.




