Skip to content

Jason-Wang313/Jason-Wang313

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 

Repository files navigation

Hi, I'm Jason Wang ๐Ÿ‘‹

AI Systems Engineer | Alignment Scholar | Control Systems Researcher

I treat Large Language Models not as black boxes, but as stochastic dynamical systems that can be modeled, monitored, and controlled. My work bridges the gap between Control Theory, Game Theory, and Systems Engineering to operationalize safety for frontier models.


๐Ÿ”ฌ Research & Engineering Portfolio

A closed-loop control system that steers internal activation states in real-time.

Status PyTorch RL

The Problem: Open-loop safety (RLHF) is brittle and prone to jailbreaks.
The Solution: An on-chip "Router" policy trained via PPO that sits inside the residual stream (Layer 15). It senses semantic state and injects steering vectors token-by-token to route generation away from harmful basins.

  • Key Result: Successfully prevents mode collapse and toxicity (e.g., "I hate everything") by dynamically modulating steering intensity only when necessary.
  • Tech: PyTorch Hooks, Gymnasium, TinyLlama, Reinforcement Learning.

๐Ÿ›ก๏ธ The Aegis Framework

Applying Non-Linear Control Theory and H-Infinity Robust Control to AI Alignment.

Status Math

The Innovation: Unlike standard alignment, Aegis models the LLM as a non-linear plant and synthesizes a mathematically rigorous controller to reject "Deception" as a system disturbance.

  • System ID: Uses Subspace System Identification (N4SID) to reverse-engineer residual stream physics.
  • State Estimation: Implements an Extended Kalman Filter (EKF) to filter polysemantic noise and measure the true "Deception State."
  • Synthesis: Solves Algebraic Riccati Equations to guarantee safety bounds under adversarial pressure.

A Defense-in-Depth Sandbox for Model Organisms of Deception.

Status Docker

The Architecture: A containment system built to trap and neutralize AI-generated "sleeper agent" malware.

  • Breeder: Generates "Agent 47" code (malware hidden in Fibonacci functions) using Llama-3.
  • Hardened Nucleus: Traps execution in a Docker container with seccomp-bpf syscall filtering and no network access.
  • The Defense: Achieved a 100% neutralization rate against exfiltration attacks by exploiting the "Utility Gap"โ€”paraphrasing code to destroy hidden steganographic channels while preserving visible utility.

Adversarial Research: Proving the existence of deceptive cognitive channels.

Status

The Findings: A proof-of-concept exploration into the limits of AI oversight.

  • Steganography: Null Space Analysis of GPT-2 found ~14 bits/token of capacity for hidden messaging.
  • Game Theory: Simulations identified the "Roger Threshold" ($\beta \approx 0.95$), the specific oversight capability level where lying becomes a Nash Equilibrium for models.
  • Cognitive Limits: Modeled human oversight failure, pinpointing "Minute 19" as the collapse point for effective monitoring.

๐Ÿ› ๏ธ Technical Arsenal

Domain Technologies & Methods
Control Theory H-Infinity ($H_\infty$) Synthesis, Kalman Filtering (EKF), N4SID, PID Dynamics
AI Implementation PyTorch Internals (Hooks), Transformers, Gymnasium, PPO, SAEs
Systems Engineering Docker, Seccomp-bpf, Linux Kernel Security, Real-time Systems
Math & Theory Game Theory (POSG), Null Space Analysis, Information Theory

๐Ÿ“ซ Connect

  • Focus: I am currently seeking roles that allow me to move alignment guarantees from "probabilistic" to "provable."
  • Code: github.com/Jason-Wang313

About

Config files for my GitHub profile.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors