You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tabmemcheck is an open-source Python library that tests language models for the memorization of tabular datasets.
12
+
Tabmemcheck is an open-source Python library to test language models for memorization of tabular datasets.
13
+
14
+
The package provides four different tests for verbatim memorization of a tabular dataset (header test, row completion test, feature completion test, first token test).
15
+
16
+
It also provides additional heuristics to test what an LLM knows about a tabular dataset (feature names test, feature values test, dataset name test, and sampling).
13
17
14
18
Features:
15
-
-[x] Test GPT-3.5, GPT-4, and other LLMs for memorization of tabular datasets.
19
+
-[x] Test GPT-3.5, GPT-4, and other LLMs for prior exposure with tabular datasets.
16
20
-[x] Supports chat models and (base) language models. In chat mode, we use few-shot learning to condition the model on the desired behavior.
17
-
-[x] The submodule ``tabmemcheck.datasets`` allows to load popular tabular datasets in perturbed form (``original``, ``perturbed``, ``task``, ``statistical``).
18
-
-[x] The package is based entirely on prompts.
21
+
-[x] The submodule ``tabmemcheck.datasets`` allows to load popular tabular datasets in perturbed form (``original``, ``perturbed``, ``task``, and ``statistical``), as used in our COLM'24 [paper](https://arxiv.org/abs/2404.06209).
22
+
-[x] The [code](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/tree/main/colm-2024-paper-code) to replicate the COLM'24 paper allows to perform few-shot learning with LLMs and tabular data.
19
23
20
-
The different tests are described in a Neurips'23 workshop [paper](https://arxiv.org/abs/2403.06644).
24
+
The different memorization tests were first described in a Neurips'23 workshop [paper](https://arxiv.org/abs/2403.06644).
21
25
22
26
To see what can be done with this package, take a look at our COLM'24 [paper](https://arxiv.org/abs/2404.06209)*"Elephants Never Forget: Memorization and Learning of Tabular data in Large Language Models"*. The code to replicate the results in the paper is [here](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/tree/main/colm-2024-paper-code).
23
27
24
28
The API reference is available [here](http://interpret.ml/LLM-Tabular-Memorization-Checker/api_reference.html).
25
29
30
+
There are example notebooks for [traditional tabular datasets](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/examples/tabular-datasets.ipynb) and the datasets used in OpenAI's [MLE-bench](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/examples/MLE-bench-contamination.ipynb)[paper](https://arxiv.org/abs/2410.07095).
31
+
26
32
### Installation
27
33
28
34
```
@@ -31,11 +37,7 @@ pip install tabmemcheck
31
37
32
38
Then use ```import tabmemcheck``` to import the Python package.
33
39
34
-
# Overview of the memorization tests
35
-
36
-
The package provides four different tests for verbatim memorization of a tabular dataset (header test, row completion test, feature completion test, first token test).
37
-
38
-
It also provides additional heuristics to assess what an LLM know about a tabular dataset (does the LLM know the names of the features in the dataset?).
40
+
# Tests for Verbatim Memorization
39
41
40
42
The header test asks the LLM to complete the initial rows of a CSV file.
41
43
@@ -93,13 +95,48 @@ There is also a simple way to run all the different tests and generate a small r
We use few-shot learning to condition chat models on the desired task. This works well for GPT-3.5 and GPT-4, and also for many other LLMs (but not necessarily for all LLMs).
100
+
The feature names test asks the LLM to complete the feature names of a dataset.
99
101
100
-
You can set ```tabmemcheck.config.print_prompts = True``` to see the prompts.
102
+
```python
103
+
tabmemcheck.feature_names_test('Kaggle Tabular Playground Series Dec 2021.csv.csv', 'gpt-4o-2024-08-06')
104
+
```
101
105
102
-
You can set ```tabmemcheck.config.print_responses = True``` to print the LLM responses, a useful sanity check.
# How should the results of the tests be interpreted?
@@ -114,16 +151,21 @@ Because one needs to weight the completions of the LLM against the entropy in th
114
151
115
152
While this all sounds very complex, the practical evidence for memorization is often very clear. This can also be seen in the examples above.
116
153
154
+
# How do the tests work?
117
155
118
-
# Can I uses this package to write my own tests?
156
+
We use few-shot learning to condition chat models on the desired task. This works well for GPT-3.5 and GPT-4, and also for many other LLMs (but not necessarily for all LLMs).
119
157
120
-
This package provides two fairly general functions
158
+
You can set ```tabmemcheck.config.print_prompts = True``` to see the prompts.
159
+
160
+
You can set ```tabmemcheck.config.print_responses = True``` to print the LLM responses, a useful sanity check.
121
161
122
-
-```tabmemcheck.chat_completion```
123
-
-```tabmemcheck.prefix_suffix_chat_completion```
162
+
# Can I use this package to write my own tests?
124
163
164
+
Yes. The module [chat_completion.py](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/tabmemcheck/chat_completion.py) provides the general-purpose function ```prefix_suffix_chat_completion``` which is used to implement most of the different tests.
125
165
166
+
You can see how ```prefix_suffix_chat_completion``` is being used by reading the implementations of the different tests in [functions.py](https://github.com/interpretml/LLM-Tabular-Memorization-Checker/blob/main/tabmemcheck/functions.py).
126
167
168
+
We also provide the general-purpose function ```chat_completion```, which again relies on ```prefix_suffix_chat_completion```.
0 commit comments