WecoAI
diff --git a/‎README.md‎
Lines changed: 11 additions & 93 deletions b/‎README.md‎
Lines changed: 11 additions & 93 deletions
diff --git a/‎examples/cuda/README.md‎
Lines changed: 40 additions & 0 deletions b/‎examples/cuda/README.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎examples/metal/README.md‎ b/‎examples/metal/README.md‎
diff --git a/‎examples/triton/README.md‎ b/‎examples/triton/README.md‎
@@ -50,13 +50,20 @@ The `weco` CLI leverages a tree search approach guided by Large Language Models
 
 ---
 
-### Examples
+### Example: Optimizing Simple PyTorch Operations
 
-**Example 1: Optimizing PyTorch simple operations**
+This basic example shows how to optimize a simple PyTorch function for speedup.
+
+For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)t**, please see the `README.md` files within the corresponding subdirectories under the [`examples/`](./examples/) folder.
 
 ```bash
+# Navigate to the example directory
 cd examples/hello-kernel-world
-pip install torch 
+
+# Install dependencies
+pip install torch
+
+# Run Weco
 weco --source optimize.py \
      --eval-command "python evaluate.py --solution-path optimize.py --device cpu" \
      --metric speedup \
@@ -66,96 +73,7 @@ weco --source optimize.py \
      --additional-instructions "Fuse operations in the forward method while ensuring the max float deviation remains small. Maintain the same format of the code."
 ```
 
-Note that if you have an NVIDIA gpu, change the device to `cuda`. If you are running this on Apple Silicon, set it to `mps`.
-
-**Example 2: Optimizing MLX operations with instructions from a file**
-
-Lets optimize a 2D convolution operation in [`mlx`](https://github.com/ml-explore/mlx) using [Metal](https://developer.apple.com/documentation/metal/). Sometimes, additional context or instructions are too complex for a single command-line string. You can provide a path to a file containing these instructions.
-
-```bash
-cd examples/metal
-pip install mlx
-weco --source optimize.py \
-     --eval-command "python evaluate.py --solution-path optimize.py" \
-     --metric speedup \
-     --maximize true \
-     --steps 30 \
-     --model gemini-2.5-pro-exp-03-25 \
-     --additional-instructions examples.rst
-```
-
-**Example 3: Level Agnostic Optimization: Causal Self Attention with Triton & CUDA**
-
-Given how useful causal multihead self attention is to transformers, we've seen its wide adoption across ML engineering and AI research. Its great to keep things at a high-level (in PyTorch) when doing research, but when moving to production you often need to write highly customized low-level kernels to make things run as fast as they can. The `weco` CLI can optimize kernels across a variety of different abstraction levels and frameworks. Example 2 uses Metal but lets explore two more frameworks:
-
-1. [Triton](https://github.com/triton-lang/triton)
-    ```bash
-   cd examples/triton
-   pip install torch triton
-   weco --source optimize.py \
-        --eval-command "python evaluate.py --solution-path optimize.py" \
-        --metric speedup \
-        --maximize true \
-        --steps 30 \
-        --model gemini-2.5-pro-exp-03-25 \
-        --additional-instructions "Use triton to optimize the code while ensuring a small max float diff. Maintain the same code format."
-   ```
-
-2. [CUDA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
-   ```bash
-   cd examples/cuda
-   pip install torch
-   weco --source optimize.py \
-        --eval-command "python evaluate.py --solution-path optimize.py" \
-        --metric speedup \
-        --maximize true \
-        --steps 30 \
-        --model gemini-2.5-pro-exp-03-25 \
-        --additional-instructions guide.md
-   ```
-
-**Example 4: Optimizing a Classification Model**
-
-This example demonstrates optimizing a script for a Kaggle competition ([Spaceship Titanic](https://www.kaggle.com/competitions/spaceship-titanic/overview)) to improve classification accuracy. The additional instructions are provided via a separate file (`examples/spaceship-titanic/README.md`).
-
-First, install the requirements for the example environment:
-```bash
-pip install -r examples/spaceship-titanic/requirements-test.txt
-```
-And run utility function once to prepare the dataset
-```bash
-python examples/spaceship-titanic/utils.py
-```
-
-You should see the following structure at `examples/spaceship-titanic`. You need to prepare the kaggle credentials for downloading the dataset.
-```
-.
-├── baseline.py
-├── evaluate.py
-├── optimize.py
-├── private
-│   └── test.csv
-├── public
-│   ├── sample_submission.csv
-│   ├── test.csv
-│   └── train.csv
-├── README.md
-├── requirements-test.txt
-└── utils.py
-```
-
-Then, execute the optimization command:
-```bash
-weco --source examples/spaceship-titanic/optimize.py \
-     --eval-command "python examples/spaceship-titanic/optimize.py && python examples/spaceship-titanic/evaluate.py" \
-     --metric accuracy \
-     --maximize true \
-     --steps 10 \
-     --model gemini-2.5-pro-exp-03-25 \
-     --additional-instructions examples/spaceship-titanic/README.md
-```
-
-*The [baseline.py](examples/spaceship-titanic/baseline.py) is provided as a start point for optimization*
+**Note:** If you have an NVIDIA GPU, change the device in the `--eval-command` to `cuda`. If you are running this on Apple Silicon, set it to `mps`.
 
 ---
 
 
@@ -0,0 +1,40 @@
+# Example: Optimizing PyTorch Self-Attention with CUDA
+
+This example showcases using Weco to optimize a PyTorch causal multi-head self-attention implementation by generating custom [CUDA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) kernels. This approach aims for low-level optimization beyond standard PyTorch or even Triton for potentially higher performance on NVIDIA GPUs.
+
+This example uses a separate Markdown file (`guide.md`) to provide detailed instructions and context to the LLM.
+
+## Setup
+
+1.  Ensure you are in the `examples/cuda` directory.
+2.  Install the required dependency:
+    ```bash
+    pip install torch
+    ```
+    *(Note: This example requires a compatible NVIDIA GPU and the CUDA Toolkit installed on your system for compiling and running the generated CUDA code.)*
+
+## Optimization Command
+
+Run the following command to start the optimization process:
+
+```bash
+weco --source optimize.py \
+     --eval-command "python evaluate.py --solution-path optimize.py" \
+     --metric speedup \
+     --maximize true \
+     --steps 30 \
+     --model gemini-2.5-pro-exp-03-25 \
+     --additional-instructions guide.md
+```
+
+### Explanation
+
+*   `--source optimize.py`: The initial PyTorch self-attention code to be optimized with CUDA.
+*   `--eval-command "python evaluate.py --solution-path optimize.py"`: Runs the evaluation script, which compiles (if necessary) and benchmarks the CUDA-enhanced code in `optimize.py` against a baseline, printing the `speedup`.
+*   `--metric speedup`: The optimization target metric.
+*   `--maximize true`: Weco aims to increase the speedup.
+*   `--steps 30`: The number of optimization iterations.
+*   `--model gemini-2.5-pro-exp-03-25`: The LLM used for code generation.
+*   `--additional-instructions guide.md`: Points Weco to a file containing detailed instructions for the LLM on how to write the CUDA kernels, handle compilation (e.g., using `torch.utils.cpp_extension`), manage data types, and ensure correctness.
+
+Weco will iteratively modify `optimize.py`, potentially generating and integrating CUDA C++ code, guided by the evaluation results and the instructions in `guide.md`.