Skip to content

Commit 7d90a8f

Browse files
committed
Version 0.1.6
1 parent 655d19d commit 7d90a8f

File tree

11 files changed

+1712
-995
lines changed

11 files changed

+1712
-995
lines changed

README.md

Lines changed: 60 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,26 @@
99
<img src="img/elephant.webp" width="800" alt="Header Test"/>
1010
</p>
1111

12-
Tabmemcheck is an open-source Python library that tests language models for the memorization of tabular datasets.
12+
Tabmemcheck is an open-source Python library to test language models for memorization of tabular datasets.
13+
14+
The package provides four different tests for verbatim memorization of a tabular dataset (header test, row completion test, feature completion test, first token test).
15+
16+
It also provides additional heuristics to test what an LLM knows about a tabular dataset (feature names test, feature values test, dataset name test, and sampling).
1317

1418
Features:
15-
- [x] Test GPT-3.5, GPT-4, and other LLMs for memorization of tabular datasets.
19+
- [x] Test GPT-3.5, GPT-4, and other LLMs for prior exposure with tabular datasets.
1620
- [x] Supports chat models and (base) language models. In chat mode, we use few-shot learning to condition the model on the desired behavior.
17-
- [x] The submodule ``tabmemcheck.datasets`` allows to load popular tabular datasets in perturbed form (``original``, ``perturbed``, ``task``, ``statistical``).
18-
- [x] The package is based entirely on prompts.
21+
- [x] The submodule ``tabmemcheck.datasets`` allows to load popular tabular datasets in perturbed form (``original``, ``perturbed``, ``task``, and ``statistical``), as used in our COLM'24 [paper](https://arxiv.org/abs/2404.06209).
22+
- [x] The [code](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/tree/main/colm-2024-paper-code) to replicate the COLM'24 paper allows to perform few-shot learning with LLMs and tabular data.
1923

20-
The different tests are described in a Neurips'23 workshop [paper](https://arxiv.org/abs/2403.06644).
24+
The different memorization tests were first described in a Neurips'23 workshop [paper](https://arxiv.org/abs/2403.06644).
2125

2226
To see what can be done with this package, take a look at our COLM'24 [paper](https://arxiv.org/abs/2404.06209) *"Elephants Never Forget: Memorization and Learning of Tabular data in Large Language Models"*. The code to replicate the results in the paper is [here](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/tree/main/colm-2024-paper-code).
2327

2428
The API reference is available [here](http://interpret.ml/LLM-Tabular-Memorization-Checker/api_reference.html).
2529

30+
There are example notebooks for [traditional tabular datasets](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/examples/tabular-datasets.ipynb) and the datasets used in OpenAI's [MLE-bench](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/examples/MLE-bench-contamination.ipynb) [paper](https://arxiv.org/abs/2410.07095).
31+
2632
### Installation
2733

2834
```
@@ -31,11 +37,7 @@ pip install tabmemcheck
3137

3238
Then use ```import tabmemcheck``` to import the Python package.
3339

34-
# Overview of the memorization tests
35-
36-
The package provides four different tests for verbatim memorization of a tabular dataset (header test, row completion test, feature completion test, first token test).
37-
38-
It also provides additional heuristics to assess what an LLM know about a tabular dataset (does the LLM know the names of the features in the dataset?).
40+
# Tests for Verbatim Memorization
3941

4042
The header test asks the LLM to complete the initial rows of a CSV file.
4143

@@ -93,13 +95,48 @@ There is also a simple way to run all the different tests and generate a small r
9395
tabmemcheck.run_all_tests("adult-test.csv", "gpt-4-0613")
9496
```
9597

96-
# How do the tests work?
98+
# Other contamination tests
9799

98-
We use few-shot learning to condition chat models on the desired task. This works well for GPT-3.5 and GPT-4, and also for many other LLMs (but not necessarily for all LLMs).
100+
The feature names test asks the LLM to complete the feature names of a dataset.
99101

100-
You can set ```tabmemcheck.config.print_prompts = True``` to see the prompts.
102+
```python
103+
tabmemcheck.feature_names_test('Kaggle Tabular Playground Series Dec 2021.csv.csv', 'gpt-4o-2024-08-06')
104+
```
101105

102-
You can set ```tabmemcheck.config.print_responses = True``` to print the LLM responses, a useful sanity check.
106+
<p align="left">
107+
<img src="img/feature_names.png" width="500" alt="Header Test"/>
108+
</p>
109+
110+
The feature values test asks the LLM to provide a typical observation from the dataset.
111+
112+
```python
113+
tabmemcheck.feature_values_test('OSIC Pulmonary Fibrosis Progression.csv', 'gpt-4o-2024-08-06')
114+
```
115+
116+
<p align="left">
117+
<img src="img/feature_values.png" width="500" alt="Header Test"/>
118+
</p>
119+
120+
More generally, you can use ```sample``` to ask the LLM to provide samples from the dataset.
121+
122+
```python
123+
tabmemcheck.sample('OSIC Pulmonary Fibrosis Progression.csv', 'gpt-4o-2024-08-06')
124+
```
125+
126+
<p align="left">
127+
<img src="img/samples.png" width="500" alt="Header Test"/>
128+
</p>
129+
130+
131+
The dataset name test asks the LLM to provide the name of the dataset, given the initial rows of the CSV file.
132+
133+
```python
134+
tabmemcheck.dataset_name_test('spooky author identification train.csv', 'gpt-4o-2024-08-06')
135+
```
136+
137+
<p align="left">
138+
<img src="img/dataset_name.png" width="500" alt="Header Test"/>
139+
</p>
103140

104141

105142
# How should the results of the tests be interpreted?
@@ -114,16 +151,21 @@ Because one needs to weight the completions of the LLM against the entropy in th
114151

115152
While this all sounds very complex, the practical evidence for memorization is often very clear. This can also be seen in the examples above.
116153

154+
# How do the tests work?
117155

118-
# Can I uses this package to write my own tests?
156+
We use few-shot learning to condition chat models on the desired task. This works well for GPT-3.5 and GPT-4, and also for many other LLMs (but not necessarily for all LLMs).
119157

120-
This package provides two fairly general functions
158+
You can set ```tabmemcheck.config.print_prompts = True``` to see the prompts.
159+
160+
You can set ```tabmemcheck.config.print_responses = True``` to print the LLM responses, a useful sanity check.
121161

122-
- ```tabmemcheck.chat_completion```
123-
- ```tabmemcheck.prefix_suffix_chat_completion```
162+
# Can I use this package to write my own tests?
124163

164+
Yes. The module [chat_completion.py](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/tabmemcheck/chat_completion.py) provides the general-purpose function ```prefix_suffix_chat_completion``` which is used to implement most of the different tests.
125165

166+
You can see how ```prefix_suffix_chat_completion``` is being used by reading the implementations of the different tests in [functions.py](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/tabmemcheck/functions.py).
126167

168+
We also provide the general-purpose function ```chat_completion```, which again relies on ```prefix_suffix_chat_completion```.
127169

128170
# Using the package with your own LLM
129171

0 commit comments

Comments
 (0)