Skip to content

Commit 4c7834f

Browse files
committed
add an overview readme to introduce the system
1 parent 0fcf2ea commit 4c7834f

File tree

2 files changed

+105
-57
lines changed

2 files changed

+105
-57
lines changed

sde_collections/models/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# URL Pattern Management System
2+
3+
## Overview
4+
This system provides a framework for managing and curating collections of URLs through pattern-based rules. It enables systematic modification, categorization, and filtering of URLs while maintaining a clear separation between work-in-progress changes and production content.
5+
6+
## Core Concepts
7+
8+
### URL States
9+
Content progresses through three states:
10+
- **Dump URLs**: Raw content from initial scraping/indexing
11+
- **Delta URLs**: Work-in-progress changes and modifications
12+
- **Curated URLs**: Production-ready, approved content
13+
14+
### Pattern Types
15+
- **Include/Exclude Patterns**: Control which URLs are included in collections
16+
- Include patterns always override exclude patterns
17+
- Use wildcards for matching multiple URLs
18+
19+
- **Modification Patterns**: Change URL properties
20+
- Title patterns modify final titles shown in search results
21+
- Document type patterns affect which tab the URL appears under
22+
- Division patterns assign URLs within the Science Knowledge Sources
23+
24+
### Pattern Resolution
25+
The system uses a "smallest set priority" strategy which resolves conflicts by always using the most specific pattern that matches a URL:
26+
- Multiple patterns can match the same URL
27+
- Pattern matching the smallest number of URLs takes precedence
28+
- Applies to title, division, and document type patterns
29+
- More specific patterns naturally override general ones
30+
31+
## Getting Started
32+
33+
To effectively understand this system, we recommend reading through the documentation in the following order:
34+
35+
1. Begin with the Pattern System Overview to learn the fundamental concepts of how patterns work and interact with URLs
36+
2. Next, explore the URL Lifecycle documentation to understand how content moves through different states
37+
3. The Pattern Resolution documentation will show you how the system handles overlapping patterns
38+
4. Learn how to control which URLs appear in your collection with the Include/Exclude patterns guide
39+
5. Finally, review the Pattern Unapplication Logic to understand how pattern removal affects your URLs
40+
41+
Each section builds upon knowledge from previous sections, providing a comprehensive understanding of the system.
42+
43+
## Documentation
44+
45+
[Pattern System Overview](./README_PATTERN_SYSTEM.md)
46+
- Core concepts and pattern types
47+
- Pattern lifecycle and effects
48+
- Delta URL generation rules
49+
- Working principles (idempotency, separation of concerns)
50+
- Pattern interaction examples
51+
52+
[URL Lifecycle Management](./README_LIFECYCLE.md)
53+
- Migration process (Dump → Delta)
54+
- Promotion process (Delta → Curated)
55+
- Field handling during transitions
56+
- Pattern application timing
57+
- Data integrity considerations
58+
59+
[Pattern Resolution](./README_PATTERN_RESOLUTION.md)
60+
- Smallest set priority mechanism
61+
- URL counting and precedence
62+
- Performance considerations
63+
- Edge case handling
64+
- Implementation details
65+
66+
[URL Inclusion/Exclusion](./README_INCLUSION.md)
67+
- Wildcard pattern matching
68+
- Include/exclude precedence
69+
- Example pattern configurations
70+
- Best practices
71+
- Common pitfalls and solutions
72+
73+
[Pattern Unapplication Logic](./README_UNAPPLY_LOGIC.md)
74+
- Pattern removal handling
75+
- Delta management during unapplication
76+
- Manual change preservation
77+
- Cleanup procedures
78+
- Edge case handling
Lines changed: 27 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,16 @@
1-
# URL Pattern Application Strategies
1+
# Pattern Resolution System
22

3-
## Strategy 1: Exclusive Patterns
3+
## Overview
4+
The pattern system uses a "smallest set priority" strategy for resolving conflicts between overlapping patterns. This applies to title patterns, division patterns, and document type patterns. The pattern that matches the smallest set of URLs takes precedence.
45

5-
Patterns have exclusive ownership of URLs they match. System prevents creation of overlapping patterns.
6+
## How It Works
67

7-
Example:
8-
```
9-
Pattern A: */docs/* # Matches 100 URLs
10-
Pattern B: */docs/api/* # Rejected - overlaps with Pattern A
11-
Pattern C: */blog/* # Accepted - no overlap
12-
```
13-
14-
Benefits:
15-
- Clear ownership
16-
- Predictable effects
17-
- Simple conflict resolution
18-
- Easy to debug
19-
20-
Drawbacks:
21-
- Less flexible
22-
- May require many specific patterns
23-
- May need pattern deletion/recreation to modify rules
24-
25-
## Strategy 2: Smallest Set Priority
8+
When multiple patterns match a URL, the system:
9+
1. Counts how many total URLs each pattern matches
10+
2. Compares the counts
11+
3. Applies the pattern that matches the fewest URLs
2612

27-
Multiple patterns can match same URLs. Pattern affecting smallest URL set takes precedence.
28-
29-
Example:
13+
### Example
3014
```
3115
Pattern A: */docs/* # Matches 100 URLs
3216
Pattern B: */docs/api/* # Matches 20 URLs
@@ -37,42 +21,28 @@ For URL "/docs/api/v2/users":
3721
- Pattern C wins (5 URLs < 20 URLs < 100 URLs)
3822
```
3923

40-
Benefits:
41-
- More flexible rule creation
42-
- Natural handling of specificity
43-
44-
Drawbacks:
45-
- Complex precedence rules
46-
- Pattern effects can change as URL sets grow
47-
- Harder to predict/debug
48-
- Performance impact from URL set size calculations
49-
50-
## Implementation Notes
24+
## Pattern Types and Resolution
5125

52-
Strategy 1:
26+
### Title Patterns
5327
```python
54-
def save(self, *args, **kwargs):
55-
# Check for overlapping patterns
56-
overlapping = self.get_matching_delta_urls().filter(
57-
deltapatterns__isnull=False
58-
).exists()
59-
if overlapping:
60-
raise ValidationError("Pattern would overlap existing pattern")
61-
super().save(*args, **kwargs)
28+
# More specific title pattern takes precedence
29+
Pattern A: */docs/* → title="Documentation" # 100 URLs
30+
Pattern B: */docs/api/* → title="API Reference" # 20 URLs
31+
Result: URL gets title "API Reference"
6232
```
6333

64-
Strategy 2:
34+
### Division Patterns
6535
```python
66-
def apply(self):
67-
matching_urls = self.get_matching_delta_urls()
68-
my_url_count = matching_urls.count()
69-
70-
# Only apply if this pattern matches fewer URLs than other matching patterns
71-
for url in matching_urls:
72-
other_patterns_min_count = url.deltapatterns.annotate(
73-
url_count=Count('delta_urls')
74-
).aggregate(Min('url_count'))['url_count__min'] or float('inf')
36+
# More specific division assignment wins
37+
Pattern A: *.pdf → division="GENERAL" # 500 URLs
38+
Pattern B: */specs/*.pdf → division="ENGINEERING" # 50 URLs
39+
Result: URL gets division "ENGINEERING"
40+
```
7541

76-
if my_url_count <= other_patterns_min_count:
77-
self.apply_to_url(url)
42+
### Document Type Patterns
43+
```python
44+
# Most specific document type classification applies
45+
Pattern A: */docs/*type="DOCUMENTATION" # 200 URLs
46+
Pattern B: */docs/data/*type="DATA" # 30 URLs
47+
Result: URL gets type "DATA"
7848
```

0 commit comments

Comments
 (0)