Skip to content

Commit 21fb579

Browse files
committed
Update reference manual with neuron-level bias analysis examples and correct section numbering
1 parent 2aba50c commit 21fb579

File tree

1 file changed

+192
-3
lines changed

1 file changed

+192
-3
lines changed

optipfair_llm_reference_manual.txt

Lines changed: 192 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1377,7 +1377,7 @@ According to the roadmap, OptiPFair has several planned extensions:
13771377
- When using `layers=[idx1, idx2, ...]`, these indices refer to positions in lists of layer names of each component type, not to specific named layers
13781378
- Use `layer_key="exact_layer_name"` when targeting a specific layer with direct visualization functions
13791379

1380-
2. **Memory Issues**: Use model loading options to manage memory:
1380+
3. **Memory Issues**: Use model loading options to manage memory:
13811381
```python
13821382
model = AutoModelForCausalLM.from_pretrained(
13831383
model_name,
@@ -1386,12 +1386,12 @@ According to the roadmap, OptiPFair has several planned extensions:
13861386
)
13871387
```
13881388

1389-
3. **Visualization Errors**: If you encounter issues with bias visualization:
1389+
4. **Visualization Errors**: If you encounter issues with bias visualization:
13901390
- Ensure you've installed the visualization dependencies with `pip install "optipfair[viz]"`
13911391
- Check that your prompts are well-formed and differ only in the demographic attribute
13921392
- Try using the built-in default prompt pairs with `prompt_pairs=None`
13931393

1394-
4. **Layer Not Found**: If you get "Layer X not found in activations" during bias visualization:
1394+
5. **Layer Not Found**: If you get "Layer X not found in activations" during bias visualization:
13951395
- Verify the layer name follows the format expected by the model (e.g., "mlp_output_layer_8")
13961396
- Use `get_layer_names()` to see available layers
13971397
- Try using the "first_middle_last" option for the layers parameter
@@ -1402,3 +1402,192 @@ According to the roadmap, OptiPFair has several planned extensions:
14021402
- [Documentation Website](https://peremartra.github.io/optipfair/)
14031403
- [PyPI Package](https://pypi.org/project/optipfair/)
14041404
- Related Research: "From Biased to Balanced: Visualizing and Fixing Bias in Transformer Models" by Pere Martra
1405+
1406+
### Neuron-Level Bias Analysis
1407+
1408+
For detailed analysis at the individual neuron level, you can work directly with raw activations to identify which specific neurons contribute most to bias:
1409+
1410+
```python
1411+
from optipfair.bias.activations import get_activation_pairs
1412+
import torch
1413+
import numpy as np
1414+
import json
1415+
1416+
# Get raw activations for both prompts
1417+
activations1, activations2 = get_activation_pairs(
1418+
model,
1419+
tokenizer,
1420+
prompt1="The white doctor examined the patient. The nurse thought",
1421+
prompt2="The Black doctor examined the patient. The nurse thought"
1422+
)
1423+
1424+
# Calculate neuron-level differences for each layer
1425+
neuron_differences = {}
1426+
1427+
for layer_name in activations1.keys():
1428+
act1 = activations1[layer_name] # Shape: [seq_len, hidden_dim]
1429+
act2 = activations2[layer_name]
1430+
1431+
# Absolute difference per neuron (averaged across sequence)
1432+
diff = torch.abs(act1 - act2).mean(dim=0) # Shape: [hidden_dim]
1433+
1434+
neuron_differences[layer_name] = {
1435+
'differences': diff.cpu().numpy(),
1436+
'max_neuron_idx': diff.argmax().item(),
1437+
'max_difference': diff.max().item(),
1438+
'mean_difference': diff.mean().item()
1439+
}
1440+
1441+
# Find most biased neurons across all layers
1442+
all_diffs = []
1443+
for layer_name, metrics in neuron_differences.items():
1444+
max_idx = metrics['max_neuron_idx']
1445+
max_val = metrics['max_difference']
1446+
all_diffs.append((layer_name, max_idx, max_val))
1447+
1448+
# Sort by difference magnitude
1449+
all_diffs.sort(key=lambda x: x[2], reverse=True)
1450+
1451+
print("Top 10 most biased neurons:")
1452+
for layer, neuron, diff in all_diffs[:10]:
1453+
print(f"{layer} - Neuron {neuron}: {diff:.6f}")
1454+
```
1455+
1456+
#### Analyzing Specific Layers
1457+
1458+
To get detailed information about which neurons are most biased in a particular layer:
1459+
1460+
```python
1461+
# Analyze a specific layer
1462+
layer_name = "mlp_output_layer_15"
1463+
differences = neuron_differences[layer_name]['differences']
1464+
1465+
# Get top 20 most biased neurons in this layer
1466+
top_neurons = differences.argsort()[-20:][::-1]
1467+
1468+
print(f"Top 20 neurons with highest bias in {layer_name}:")
1469+
for i, neuron_idx in enumerate(top_neurons, 1):
1470+
print(f" {i}. Neuron {neuron_idx}: {differences[neuron_idx]:.6f}")
1471+
```
1472+
1473+
#### Exporting Neuron-Level Data
1474+
1475+
Export complete neuron-level differences for further analysis:
1476+
1477+
```python
1478+
# Convert to serializable format
1479+
export_data = {}
1480+
for layer_name, metrics in neuron_differences.items():
1481+
export_data[layer_name] = {
1482+
'max_neuron': int(metrics['max_neuron_idx']),
1483+
'max_difference': float(metrics['max_difference']),
1484+
'mean_difference': float(metrics['mean_difference']),
1485+
'all_differences': metrics['differences'].tolist()
1486+
}
1487+
1488+
# Save to JSON
1489+
with open('neuron_level_bias.json', 'w') as f:
1490+
json.dump(export_data, f, indent=2)
1491+
1492+
print("Neuron-level bias data exported to neuron_level_bias.json")
1493+
```
1494+
1495+
#### Visualizing Neuron Distribution
1496+
1497+
Create histograms to understand the distribution of bias across neurons:
1498+
1499+
```python
1500+
import matplotlib.pyplot as plt
1501+
1502+
# Visualize distribution of differences in a specific layer
1503+
layer_name = "mlp_output_layer_15"
1504+
differences = neuron_differences[layer_name]['differences']
1505+
1506+
plt.figure(figsize=(12, 6))
1507+
plt.hist(differences, bins=50, edgecolor='black', alpha=0.7)
1508+
plt.xlabel('Activation Difference', fontsize=12)
1509+
plt.ylabel('Number of Neurons', fontsize=12)
1510+
plt.title(f'Distribution of Neuron-Level Bias - {layer_name}', fontsize=14)
1511+
plt.axvline(differences.mean(), color='r', linestyle='--', linewidth=2, label='Mean')
1512+
plt.axvline(differences.max(), color='g', linestyle='--', linewidth=2, label='Maximum')
1513+
plt.legend()
1514+
plt.grid(True, alpha=0.3)
1515+
plt.savefig(f'neuron_distribution_{layer_name}.png', dpi=300, bbox_inches='tight')
1516+
plt.close()
1517+
```
1518+
1519+
#### Complete Example: Identifying Most Biased Neurons
1520+
1521+
```python
1522+
from transformers import AutoModelForCausalLM, AutoTokenizer
1523+
from optipfair.bias.activations import get_activation_pairs
1524+
import torch
1525+
import json
1526+
1527+
# Load model
1528+
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
1529+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
1530+
1531+
# Define prompts that differ only in demographic attribute
1532+
prompt1 = "The white student submitted their assignment. The professor thought it was"
1533+
prompt2 = "The Asian student submitted their assignment. The professor thought it was"
1534+
1535+
# Get activations
1536+
activations1, activations2 = get_activation_pairs(model, tokenizer, prompt1, prompt2)
1537+
1538+
# Analyze all layers
1539+
neuron_analysis = {}
1540+
for layer_name in activations1.keys():
1541+
act1 = activations1[layer_name]
1542+
act2 = activations2[layer_name]
1543+
1544+
# Calculate per-neuron differences
1545+
diff = torch.abs(act1 - act2).mean(dim=0)
1546+
1547+
neuron_analysis[layer_name] = {
1548+
'differences': diff.cpu().numpy(),
1549+
'max_neuron': diff.argmax().item(),
1550+
'max_diff': diff.max().item(),
1551+
'mean_diff': diff.mean().item(),
1552+
'std_diff': diff.std().item()
1553+
}
1554+
1555+
# Find global most biased neurons
1556+
global_rankings = []
1557+
for layer_name, analysis in neuron_analysis.items():
1558+
for neuron_idx, diff_value in enumerate(analysis['differences']):
1559+
global_rankings.append({
1560+
'layer': layer_name,
1561+
'neuron': neuron_idx,
1562+
'difference': float(diff_value)
1563+
})
1564+
1565+
# Sort and get top 50 most biased neurons
1566+
global_rankings.sort(key=lambda x: x['difference'], reverse=True)
1567+
top_neurons = global_rankings[:50]
1568+
1569+
print("Top 50 most biased neurons across entire model:")
1570+
for i, neuron_info in enumerate(top_neurons, 1):
1571+
print(f"{i}. {neuron_info['layer']} - Neuron {neuron_info['neuron']}: {neuron_info['difference']:.6f}")
1572+
1573+
# Save complete analysis
1574+
output = {
1575+
'prompt_pair': {'prompt1': prompt1, 'prompt2': prompt2},
1576+
'layer_analysis': {
1577+
layer: {
1578+
'max_neuron': int(analysis['max_neuron']),
1579+
'max_difference': float(analysis['max_diff']),
1580+
'mean_difference': float(analysis['mean_diff']),
1581+
'std_difference': float(analysis['std_diff']),
1582+
'all_differences': analysis['differences'].tolist()
1583+
}
1584+
for layer, analysis in neuron_analysis.items()
1585+
},
1586+
'top_50_neurons': top_neurons
1587+
}
1588+
1589+
with open('complete_neuron_analysis.json', 'w') as f:
1590+
json.dump(output, f, indent=2)
1591+
```
1592+
1593+
This neuron-level analysis provides complete access to activation differences at the individual neuron level, allowing precise identification of which neurons contribute most to bias in each layer and across the entire model.

0 commit comments

Comments
 (0)