Skip to content

Commit d89d99f

Browse files
author
Magdalen Manohar
committed
add filter utility and filter file information
1 parent 5afbdc4 commit d89d99f

File tree

2 files changed

+82
-2
lines changed

2 files changed

+82
-2
lines changed

dataset_preparation/caselaw_dataset.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,18 +37,30 @@ The metadata is released in jsonl format, with one line per document, and the fo
3737

3838
1. "doc_id": the document id based on the document's numeric order in the base file.
3939
2. "case_id": a unique identifier drawn from the original parquet files; can be used to match a case back to its complete metadata and opinion text.
40-
3. "date_filed": the date the case was filed in YYY-MM-DD format.
40+
3. "date_filed": the date the case was filed. Originally in YYYY-MM-DD format, we converted dates to ordinals using Python's `datetime` library for ease of use for comparative operators. They can be converted back to the original format using `datetime`'s `fromordinal()` function.
4141
4. "court_jurisdiction": the place of jurisdiction of the court. Typically either a US state or the entire United States.
4242
5. "court_type": the court type as a one- or two-letter abbreviation (appeals, criminal, circuit, and so on).
4343
6. "court_full_name": the full name of the court.
4444

4545
Filters for the base and query sets can be downloaded using the following urls:
4646

4747
```bash
48-
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_base_filters.jsonl
48+
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_base_metadata.jsonl
49+
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_query_metadata.jsonl
50+
```
51+
52+
In addition to releasing metadata, we also curated a set of filtered queries from the query metadata and computed ground truth with respect to points satisfying those queries, again using Chamfer distance. The filter queries as well as groundtruth for the full dataset and the first 1M and 100K vectors can be downloaded using the following links:
53+
54+
```bash
4955
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_query_filters.jsonl
56+
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_filtered_gt.bin
57+
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_filtered_gt_1M.bin
58+
wget https://comp21storage.z5.web.core.windows.net/caselaw/caselaw_filtered_gt_100K.bin
5059
```
5160

61+
The filtered groundtruth is calculated for up to the top 100 points, but since some very selective queries may have fewer than 100 points satisfying the filter predicate, we use the range groundtruth format for storing the groundtruth files. It consists of the number of points, followed by the total number of results, the number of results per point, and then the identifiers of the ground truth points. A reader for this format can be found in the function `range_result_read` in `benchmark/dataset_io.py`.
62+
63+
Python functions for reading the jsonl metadata files and checking whether a line satisfies a given filter query are included in `jsonl_filter_utils.py`. Unfortunately they are not currently compatible with the utilities described in `filter_utils.md`.
5264

5365
## Development
5466

@@ -58,6 +70,14 @@ This section contains details on the development of the dataset which may be use
5870

5971
Each legal case was formatted as a JSON string encoding all its fields, with the "opinion" field at the end. If the total number of tokens was larger than the 8192-token context window, the string was chunked into multiple text strings with 512-token overlap between chunks. Strings were embedded using OpenAI's [text-embedding-3-small](https://platform.openai.com/docs/models/text-embedding-3-small), with 1532 floating-point dimensions.
6072

73+
### Filter Curation
74+
75+
Here we provide a brief summary of how we selected a filter query for each query. Each of the four nontrivial metadata fields (excluding the document ids from consideration) was converted to a predicate. For fields "court_full_name", "court_jurisdiction", and "court_type", the predicate is in the form of equality to the string value. For "date_filed", a date range around the date of filing was used. For dates prior to 1900, a twenty-year radius around the date filed was used. For dates between 1900-1950, a ten-year radius was used; for dates 1950-present, a four-year radius was used.
76+
77+
Each query was randomly assigned a single predicate with probability one-third or a logical AND of two predicates with probability two-thirds. For the purposes of this document, a date radius query was counted as a single predicate, even though it is technically encoded as an AND of two predicates (less than a particular date and greater than a particular date). For the single-predicate queries, one of the four fields was randomly chosen with a slight bias towards court name to help keep the average specificity low. For the double-predicate queries, the first field was randomly chosen, but since the name of a court implies its type and jurisdiction, if "court_name" was the first field selected we disallowed type and jurisdiction for the second field, and similarly if type or jurisdiction were selected as the first field, we disallowed name as the second field.
78+
79+
The average specificity (proportion of base points satisfying a given query) was about 4.5%, with maximum value 38% and minimum 0%. We did not disallow non-satisfiable queries as they are a phenomenon that can validly occur in filter scenarios, but they were empirically very rare.
80+
6181
### Notes
6282

6383
The case with "case_id" 4292693 was omitted from the embeddings as its opinion seemed to consist of thousands of pages of degenerate text. Otherwise, each file from the [COLD Cases release on HuggingFace](https://huggingface.co/datasets/harvard-lil/cold-cases) was embedded and released.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import json
2+
3+
def read_jsonl_file(jsonl_filename):
4+
"""
5+
Reads a JSONL file and returns a list of JSON objects.
6+
"""
7+
data = []
8+
with open(jsonl_filename, 'r') as f:
9+
for line in f:
10+
entry = json.loads(line)
11+
data.append(entry)
12+
return data
13+
14+
def is_predicate_satisfied(metadata, predicate):
15+
field, expression = next(iter(predicate.items()))
16+
operator, predicate_value = next(iter(expression.items()))
17+
metadata_value = metadata[field]
18+
19+
if operator == '$eq':
20+
if metadata_value != predicate_value:
21+
return False
22+
elif operator == '$ne':
23+
if metadata_value == predicate_value:
24+
return False
25+
elif operator == '$lt':
26+
if metadata_value >= predicate_value:
27+
return False
28+
elif operator == '$lte':
29+
if metadata_value > predicate_value:
30+
return False
31+
elif operator == '$gt':
32+
if metadata_value <= predicate_value:
33+
return False
34+
elif operator == '$gte':
35+
if metadata_value < predicate_value:
36+
return False
37+
else:
38+
raise ValueError(f"Unsupported operator: {operator}")
39+
40+
return True
41+
42+
# Computes whether a given metadata string satisfies the given filter condition
43+
def is_match(metadata, query_filter):
44+
match = True
45+
if '$and' in query_filter['filter']:
46+
for cond in query_filter['filter']['$and']:
47+
match = match and is_predicate_satisfied(metadata, cond)
48+
else:
49+
match = match and is_predicate_satisfied(metadata, query_filter['filter'])
50+
return match
51+
52+
# Example usage from files in the caselaw dataset release:
53+
54+
# metadata_file = "caselaw_base_metadata.jsonl"
55+
# query_filter_file = "caselaw_query_filters.jsonl"
56+
57+
# base_metadata = read_jsonl_file(metadata_file)
58+
# query_filters = read_jsonl_file(query_filter_file)
59+
60+
# match_0_0 = is_match(base_metadata[0], query_filters[0])

0 commit comments

Comments
 (0)