Skip to content

Commit e2d9b6b

Browse files
committed
WIP improve legislation embeddings
1 parent 608d91e commit e2d9b6b

24 files changed

+6326
-662
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# How to Create a Pull Request Using GitHub CLI
2+
3+
This guide explains how to create pull requests using GitHub CLI in our project.
4+
5+
**Important**: All PR titles and descriptions should be written in English.
6+
7+
## Prerequisites
8+
9+
1. Install GitHub CLI if you haven't already:
10+
11+
```bash
12+
# macOS
13+
brew install gh
14+
15+
# Windows
16+
winget install --id GitHub.cli
17+
18+
# Linux
19+
# Follow instructions at https://github.com/cli/cli/blob/trunk/docs/install_linux.md
20+
```
21+
22+
2. Authenticate with GitHub:
23+
```bash
24+
gh auth login
25+
```
26+
27+
## Creating a New Pull Request
28+
29+
1. First, prepare your PR description following the template in @.github/pull_request_template.md
30+
31+
2. Use the `gh pr create --draft` command to create a new pull request:
32+
33+
```bash
34+
# Create PR with proper template structure
35+
gh pr create --draft --title "✨(scope): Your descriptive title" --body-file .github/pull_request_template.md --base main
36+
```
37+
38+
## Best Practices
39+
40+
1. **Language**: Always use English for PR titles and descriptions
41+
42+
2. **PR Title Format**: Use conventional commit format with emojis
43+
44+
- Always include an appropriate emoji at the beginning of the title
45+
- Use the actual emoji character (not the code representation like `:sparkles:`)
46+
- Examples:
47+
- `✨(supabase): Add staging remote configuration`
48+
- `🐛(auth): Fix login redirect issue`
49+
- `📝(readme): Update installation instructions`
50+
51+
3. **Description Template**: Always use our PR template structure from @.github/pull_request_template.md:
52+
53+
4. **Template Accuracy**: Ensure your PR description precisely follows the template structure:
54+
55+
- Don't modify or rename the PR-Agent sections (`pr_agent:summary` and `pr_agent:walkthrough`)
56+
- Keep all section headers exactly as they appear in the template
57+
- Don't add custom sections that aren't in the template
58+
59+
5. **Draft PRs**: Start as draft when the work is in progress
60+
- Use `--draft` flag in the command
61+
- Convert to ready for review when complete using `gh pr ready`
62+
63+
### Common Mistakes to Avoid
64+
65+
1. **Using Non-English Text**: All PR content must be in English
66+
2. **Incorrect Section Headers**: Always use the exact section headers from the template
67+
3. **Adding Custom Sections**: Stick to the sections defined in the template
68+
4. **Using Outdated Templates**: Always refer to the current @.github/pull_request_template.md file
69+
70+
### Missing Sections
71+
72+
Always include all template sections, even if some are marked as "N/A" or "None"
73+
74+
## Additional GitHub CLI PR Commands
75+
76+
Here are some additional useful GitHub CLI commands for managing PRs:
77+
78+
```bash
79+
# List your open pull requests
80+
gh pr list --author "@me"
81+
82+
# Check PR status
83+
gh pr status
84+
85+
# View a specific PR
86+
gh pr view <PR-NUMBER>
87+
88+
# Check out a PR branch locally
89+
gh pr checkout <PR-NUMBER>
90+
91+
# Convert a draft PR to ready for review
92+
gh pr ready <PR-NUMBER>
93+
94+
# Add reviewers to a PR
95+
gh pr edit <PR-NUMBER> --add-reviewer username1,username2
96+
97+
# Merge a PR
98+
gh pr merge <PR-NUMBER> --squash
99+
```
100+
101+
## Using Templates for PR Creation
102+
103+
To simplify PR creation with consistent descriptions, you can create a template file:
104+
105+
1. Create a file named `pr-template.md` with your PR template
106+
2. Use it when creating PRs:
107+
108+
```bash
109+
gh pr create --draft --title "feat(scope): Your title" --body-file pr-template.md --base main
110+
```
111+
112+
## Related Documentation
113+
114+
- [PR Template](.github/pull_request_template.md)
115+
- [Conventional Commits](https://www.conventionalcommits.org/)
116+
- [GitHub CLI documentation](https://cli.github.com/manual/)

.github/pull_request_template.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
## Summary
2+
3+
<!-- What does this PR do? Keep it concise but complete. -->
4+
5+
## Motivation
6+
7+
<!-- Why is this change needed? What problem does it solve? -->
8+
9+
Closes #<!-- issue number -->
10+
11+
## Type of Change
12+
13+
- [ ] Bug fix (non-breaking change that fixes an issue)
14+
- [ ] New feature (non-breaking change that adds functionality)
15+
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
16+
- [ ] Refactor (code change that neither fixes a bug nor adds a feature)
17+
- [ ] Documentation update
18+
- [ ] Test update
19+
- [ ] Chore (dependency updates, config changes, etc.)
20+
21+
## Changes Made
22+
23+
<!-- Bullet point the key changes in this PR -->
24+
25+
-
26+
27+
## Testing
28+
29+
<!-- How did you test these changes? Include steps to reproduce if applicable. -->
30+
31+
- [ ] Tested locally
32+
- [ ] Added/updated tests
33+
- [ ] Verified existing tests pass
34+
35+
## Screenshots
36+
37+
<!-- If this PR includes UI changes, add screenshots or screen recordings here. Delete this section if not applicable. -->
38+
39+
## Checklist
40+
41+
- [ ] My code follows the project's coding standards
42+
- [ ] `pnpm check` passes (lint + type check)
43+
- [ ] I have added tests that prove my fix/feature works
44+
- [ ] New and existing tests pass locally
45+
- [ ] I have updated documentation where needed
46+
- [ ] I have reviewed my own code before requesting review
47+
48+
## Additional Notes
49+
50+
<!-- Any other context, concerns, or things reviewers should know? Delete if not needed. -->

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ data/parliament/openparliament.public.sql
5555
scripts/.embedding-progress.db
5656
scripts/.embedding-progress.db-wal
5757
scripts/.embedding-progress.db-shm
58+
scripts/.leg-embedding-progress.db
5859

5960
# AI Coding
6061
.plans

lib/ai/tools/retrieve-legislation-context.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ export async function getLegislationContext(
8585

8686
dbg("search returned %d results", results.length);
8787

88-
// Build context with reranking
89-
const context = buildLegislationContext(query, results, {
88+
// Build context with reranking (now async with Cohere cross-encoder)
89+
const context = await buildLegislationContext(query, results, {
9090
language: preferLang,
9191
topN: boundedLimit,
9292
});

lib/db/rag/schema.ts

Lines changed: 95 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,42 @@
44
* Contains all RAG-related tables for parliament and legislation embeddings.
55
*/
66

7-
import type { InferSelectModel } from "drizzle-orm";
7+
import { type InferSelectModel, sql } from "drizzle-orm";
88
import {
9+
check,
910
customType,
1011
index,
1112
integer,
1213
jsonb,
1314
pgSchema,
1415
text,
1516
timestamp,
17+
unique,
1618
varchar,
1719
vector,
1820
} from "drizzle-orm/pg-core";
1921
import { nanoid } from "nanoid";
2022

23+
/**
24+
* Valid source types for legislation resources
25+
* Used for CHECK constraint and TypeScript type alignment
26+
*/
27+
export const LEG_SOURCE_TYPES = [
28+
"act",
29+
"act_section",
30+
"regulation",
31+
"regulation_section",
32+
"defined_term",
33+
"preamble",
34+
"treaty",
35+
"cross_reference",
36+
"table_of_provisions",
37+
"signature_block",
38+
"related_provisions",
39+
] as const;
40+
41+
export type LegSourceType = (typeof LEG_SOURCE_TYPES)[number];
42+
2143
export const ragSchema = pgSchema("rag");
2244

2345
/**
@@ -182,8 +204,8 @@ export type ParlEmbedding = InferSelectModel<typeof parlEmbeddings>;
182204
* Fields needed for search filtering and citation building.
183205
*/
184206
export type LegResourceMetadata = {
185-
// Identity - source types for acts, regulations, and their sections
186-
sourceType: "act" | "act_section" | "regulation" | "regulation_section";
207+
// Identity - source types for all legislation content
208+
sourceType: LegSourceType;
187209
language: "en" | "fr";
188210
chunkIndex?: number; // 0 for metadata chunk, 1+ for content chunks
189211

@@ -196,6 +218,33 @@ export type LegResourceMetadata = {
196218
sectionId?: string; // FK to legislation.sections.id
197219
sectionLabel?: string; // e.g., "91", "Schedule I"
198220
marginalNote?: string; // Short description of section
221+
sectionStatus?: string; // "in-force", "repealed", "not-in-force", etc.
222+
sectionType?: string; // "section", "schedule", "preamble", "heading", etc.
223+
hierarchyPath?: string[]; // e.g., ["Part I", "Division 1", "Subdivision A"]
224+
contentFlags?: {
225+
// Mirrors ContentFlags from legislation schema
226+
hasTable?: boolean;
227+
hasFormula?: boolean;
228+
hasImage?: boolean;
229+
imageSources?: string[];
230+
hasRepealed?: boolean;
231+
};
232+
sectionInForceDate?: string; // ISO date when section came into force
233+
historicalNotes?: {
234+
// Mirrors HistoricalNoteItem from legislation schema
235+
text: string;
236+
type?: string;
237+
enactedDate?: string;
238+
inForceStartDate?: string;
239+
enactId?: string;
240+
}[];
241+
242+
// Defined term specific fields
243+
termId?: string; // FK to legislation.defined_terms.id
244+
term?: string; // The defined term itself (e.g., "barrier", "obstable")
245+
termPaired?: string; // The paired term in other language
246+
scopeType?: string; // "act", "regulation", "part", "section"
247+
scopeSections?: string[]; // Section scope if applicable
199248

200249
// Act metadata fields
201250
longTitle?: string;
@@ -211,6 +260,34 @@ export type LegResourceMetadata = {
211260
enablingActId?: string;
212261
enablingActTitle?: string;
213262
registrationDate?: string;
263+
264+
// Preamble-specific fields
265+
preambleIndex?: number; // Position in preamble array
266+
267+
// Treaty-specific fields
268+
treatyTitle?: string; // Title of the treaty/convention
269+
270+
// Cross-reference fields
271+
crossRefId?: string; // FK to legislation.cross_references.id
272+
targetType?: string; // "act" or "regulation"
273+
targetRef?: string; // Reference to target document
274+
targetSectionRef?: string; // Optional section reference
275+
referenceText?: string; // Display text for the reference
276+
277+
// Table of provisions fields
278+
provisionLabel?: string; // Label from table of provisions
279+
provisionTitle?: string; // Title from table of provisions
280+
provisionLevel?: number; // Hierarchy level
281+
282+
// Signature block fields
283+
signatureName?: string; // Name of signatory
284+
signatureTitle?: string; // Title of signatory
285+
signatureDate?: string; // Date of signature
286+
287+
// Related provisions fields
288+
relatedProvisionLabel?: string; // Label from related provision (e.g., "Transitional Provisions")
289+
relatedProvisionSource?: string; // Source reference
290+
relatedProvisionSections?: string[]; // Referenced section numbers
214291
};
215292

216293
/**
@@ -224,16 +301,29 @@ export const legResources = ragSchema.table(
224301
id: varchar("id", { length: 191 })
225302
.primaryKey()
226303
.$defaultFn(() => nanoid()),
227-
sectionId: varchar("section_id", { length: 191 }).notNull(),
304+
// Unique resource key for deduplication: "{sourceType}:{sourceId}:{language}:{chunkIndex}"
305+
resourceKey: varchar("resource_key", { length: 255 }).notNull(),
228306
content: text("content").notNull(),
229307
metadata: jsonb("metadata").$type<LegResourceMetadata>().notNull(),
308+
// Denormalized columns for fast filtering (avoids JSONB extraction in queries)
309+
language: varchar("language", { length: 2 }).notNull(),
310+
sourceType: varchar("source_type", { length: 30 }).notNull(),
230311
createdAt: timestamp("created_at").defaultNow().notNull(),
231312
updatedAt: timestamp("updated_at").defaultNow().notNull(),
232313
},
233314
(table) => [
234-
index("leg_resources_section_id_idx").on(table.sectionId),
315+
// Unique constraint to prevent duplicates on concurrent runs or restarts
316+
unique("leg_resources_resource_key_unique").on(table.resourceKey),
317+
index("leg_resources_resource_key_idx").on(table.resourceKey),
318+
// Composite index for common filtering patterns (language + sourceType)
319+
index("leg_resources_lang_source_idx").on(table.language, table.sourceType),
235320
// Single GIN index on metadata for flexible querying
236321
index("leg_resources_metadata_gin").using("gin", table.metadata),
322+
// CHECK constraint for valid source types (data integrity)
323+
check(
324+
"leg_resources_source_type_check",
325+
sql`${table.sourceType} IN (${sql.raw(LEG_SOURCE_TYPES.map((t) => `'${t}'`).join(", "))})`
326+
),
237327
]
238328
);
239329

0 commit comments

Comments
 (0)