AnswerDotAI
diff --git a/‎README.md
Lines changed: 7 additions & 1 deletion b/‎README.md
Lines changed: 7 additions & 1 deletion
diff --git a/‎charts/attention_loss.py
Lines changed: 67 additions & 0 deletions b/‎charts/attention_loss.py
Lines changed: 67 additions & 0 deletions
diff --git a/‎images/AttentionLoss.png renamed to ‎images/attention_loss_concept.png b/‎images/AttentionLoss.png renamed to ‎images/attention_loss_concept.png
diff --git a/‎images/attention_loss_pg19.png
217 KB b/‎images/attention_loss_pg19.png
217 KB
@@ -249,7 +249,7 @@ To better understand why one method may work better than another, it is importan
 
 Specifically, it’s nice to be able to understand the deviation from the full attention caused by token dropping. As defined in [L2-Norm](https://arxiv.org/abs/2406.11430) and [FastGen](https://arxiv.org/abs/2310.01801), we compute the attention loss as the sum of the attention probabilities for the evicted tokens.
 
-![Attention Loss Diagram](images/AttentionLoss.png)
+![Attention Loss Diagram](images/attention_loss_concept.png)
 
 To calculate the **Attention Loss**, we need to keep all tokens in the KVCache, e.g., set cache strategy to `full`, while simulating evictions for a compressed cache.
 
@@ -265,6 +265,12 @@ A handful of debugging experiments can be kicked off by running:
 bash experiments/attention_loss.sh
 ```
 
+These experiments record Attention Loss at various decoding steps. From these experiments, which record PPL on [PG-19 Book Corpus](https://github.com/google-deepmind/pg19), we can show a clear correlation between Attention Loss and downstream performance (PPL).
+
+![Attention Loss Results](images/attention_loss_pg19.png)
+
+This suggests that **Attention Loss** might be an decent proxy to approximate downstream degradation from compression.
+
 ## Extending Cold Compress
 
 ### Adding a new Cache Strategy
 
@@ -0,0 +1,67 @@
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+
+
+if __name__ == "__main__":
+    import matplotlib.pyplot as plt
+    import numpy as np
+    # Define the data
+    df = pd.read_csv("/workspace/attention_loss.csv")
+
+    decoding_steps = np.arange(500, 8500, 500)
+    n = len(decoding_steps)
+    models = ['Low Compression', 'Medium Compression', 'High Compression']
+
+    # Sample data - replace with your actual data
+    attention_loss = {
+        'Low Compression': df["25_attention_loss"][:n],
+        'Medium Compression': df["50_attention_loss"][:n],
+        'High Compression': df["75_attention_loss"][:n],
+    }
+
+    ppl_delta = {
+        'Low Compression': df["25_ppl_delta"][:n],
+        'Medium Compression': df["50_ppl_delta"][:n],
+        'High Compression': df["75_ppl_delta"][:n],
+    }
+
+    # Create the plot
+    plt.rcParams.update({'font.size': 20})
+
+    fig, ax1 = plt.subplots(figsize=(20, 10))
+
+    # Colors for each model
+    colors = ["#006AA7", '#16a085', '#8e44ad', '#d35400']
+
+    # Plot Attention Loss
+    for model, color in zip(models, colors):
+        ax1.plot(decoding_steps, attention_loss[model], color=color, label=f'{model} (Attention Loss)', linewidth=6)
+        ax1.scatter(decoding_steps, attention_loss[model], color=color, s=400)
+
+    ax1.set_xlabel('Decoding Steps', fontsize=32)
+    ax1.set_ylabel('Attention Loss', fontsize=32)
+
+    ax1.tick_params(axis='y', labelsize=32)
+    ax1.tick_params(axis='x', labelsize=32)
+
+    # Create a second y-axis for PPL
+    ax2 = ax1.twinx()
+
+    # Plot Perplexity (PPL)
+    for model, color in zip(models, colors):
+        ax2.plot(decoding_steps, ppl_delta[model], color=color, linestyle='--', label=f'{model} (PPL Δ)', linewidth=6)
+        ax2.scatter(decoding_steps, ppl_delta[model], color=color, marker='s', s=400)
+
+    ax2.set_ylabel("Perplexity Delta (PPL Δ)", fontsize=32)
+    ax2.tick_params(axis="y", labelsize=32)
+
+    # Combine legends
+    lines1, labels1 = ax1.get_legend_handles_labels()
+    lines2, labels2 = ax2.get_legend_handles_labels()
+    ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left', bbox_to_anchor=(0.05, 0.95), borderaxespad=0.25, fontsize=24)
+
+    plt.title("Attention Loss & Perplexity vs Decoding Steps", fontsize=32)
+    plt.grid(True)
+    plt.tight_layout()
+    plt.savefig("/workspace/cold-compress/charts/attention_loss.png")