Skip to content
View KuangjuX's full-sized avatar
😔
Depression
😔
Depression

Organizations

@twtstudio @Ko-oK-OS @HMUniversity @TJUCS @NSCSCC-2022-TJU @raspberrypi-embedded @KuangjuX-Archived @HeliosXCore @TiledTensor

Block or report KuangjuX

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
KuangjuX/README.md

Chengxiang Qi (齐呈祥)

🏠 Homepage📝 Zhihu💻 GitHub

M.Eng. Student in Computer Technology @ University of Chinese Academy of Sciences (UCAS)
B.Eng. in Computer Science @ Tianjin University (Outstanding Thesis Award)


About Me

I am a final-year master's student at the University of Chinese Academy of Sciences, focusing on ML Systems, Deep Learning Compilers, and GPU Programming. Previously, I worked extensively with the Rust programming language on systems-level projects including operating systems and hypervisors.

Research Interests

Machine Learning Systems · Deep Learning Compilers · GPU Kernel Optimization · CUDA Programming · Operating Systems · Virtualization


Experience

WeChat — LLM Infra Team · ML System Intern · June 2025 – Present

  • Implemented Light-DuoAttention using CuTeDSL for efficient long-context inference, integrated and running within SGLang.
  • Explored NVSHMEM & DeepEP; built NVSHMEM-Tutorial with hybrid CUDA IPC / RDMA communication for internal technical sharing.
  • Implemented Ring Attention Forward based on ThunderKittens using the LCF template, outperforming ring-flash-attention on short sequences. Implemented Flash Attention Backward based on LCF.
  • Performed performance analysis on MagiAttention, ZigZag Ring Attention, and ZigZag Flex Attention.
  • Investigated DSL design on NVIDIA Hopper architecture.

Microsoft Research Asia — System & Network Group · Research Intern · Feb 2024 – May 2025

  • Based on the FractalTensor programming model, optimized GEMM, Back-to-Back GEMMs, Stacked/Dilated LSTM, and FlashAttention-2 using CUTLASS. Achieved up to 5.45× speedup over SOTA on NVIDIA A100, with 2.14× average acceleration.
  • As a core designer and developer, built TileFusion: an efficient C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
  • Mentored by Dr. Ying Cao. Co-first authored a paper published at SOSP'24.

Tsinghua University — OS Laboratory · Research Intern · May 2023 – July 2023

  • Wrote an Intel 82599 NIC driver in Rust (referencing DPDK for optimization) and integrated it into ArceOS. Performed network performance benchmarking and optimization.
  • Developed a Type-2 hypervisor based on ArceOS capable of booting Linux; built Hypercraft as a standalone VMM library.

Selected Projects

Project Description Stars
microsoft/TileFusion C++ macro kernel template library for tile processing across GPU memory hierarchy with TensorCore support Stars
microsoft/FractalTensor Programming framework for organizing DNN data as nested statically-shaped tensors with automatic compiler analysis Stars
NVSHMEM-Tutorial Build a DeepEP-like GPU communication buffer with NVSHMEM (hybrid CUDA IPC / RDMA) Stars
xv6-rust Reimplementation of MIT xv6-riscv in Rust; reference implementation for OSCOMP Stars
arceos Experimental modular OS in Rust — contributed hypervisor, ixgbe NIC driver, and network optimization Stars
Hypercraft VMM library in Rust for RISC-V / AArch64 virtualization, capable of booting Linux Stars
hypocaust-2 Hardware-assisted RISC-V hypervisor using H Extension; boots rCore, RT-Thread, and Linux Stars

Publications

  • Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor Siran Liu*, Chengxiang Qi*, Ying Cao, Chao Yang, Weifang Hu, Xuanhua Shi, Fan Yang, Mao Yang ACM SIGOPS 30th Symposium on Operating Systems Principles (SOSP'24) · (*equal contribution) [Paper] [Code]

  • 基于 RISC-V 的 Type-1 Hypervisor 的设计与实现 Chengxiang Qi Bachelor Thesis, Tianjin University · (Outstanding Thesis Award) [Code]


Talks

  • Hypocaust: a RISC-V Type-1 Hypervisor Written in RustOS2ATC 2022, Beijing (March 2023) Presentation on the design and implementation of a RISC-V Type-1 hypervisor, showcasing virtualization techniques and system-level Rust programming.

Tech Stack

Pinned Loading

  1. TiledTensor/TiledCUDA TiledTensor/TiledCUDA Public

    We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstra…

    C++ 192 11

  2. microsoft/TileFusion microsoft/TileFusion Public

    TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.

    Cuda 106 7

  3. Ko-oK-OS/xv6-rust Ko-oK-OS/xv6-rust Public

    🦀️ Reimplement xv6-riscv in Rust!

    Rust 353 35

  4. arceos-org/arceos arceos-org/arceos Public

    An experimental modular OS written in Rust.

    Rust 735 426

  5. hypercraft hypercraft Public

    hypercraft is a VMM library written in Rust.

    Rust 54 17

  6. NVSHMEM-Tutorial NVSHMEM-Tutorial Public

    NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

    Cuda 163 14