You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This system provides a framework for managing and curating collections of URLs through pattern-based rules. It enables systematic modification, categorization, and filtering of URLs while maintaining a clear separation between work-in-progress changes and production content.
5
+
6
+
## Core Concepts
7
+
8
+
### URL States
9
+
Content progresses through three states:
10
+
-**Dump URLs**: Raw content from initial scraping/indexing
11
+
-**Delta URLs**: Work-in-progress changes and modifications
The pattern system uses a "smallest set priority" strategy for resolving conflicts between overlapping patterns. This applies to title patterns, division patterns, and document type patterns. The pattern that matches the smallest set of URLs takes precedence.
4
5
5
-
Patterns have exclusive ownership of URLs they match. System prevents creation of overlapping patterns.
6
+
## How It Works
6
7
7
-
Example:
8
-
```
9
-
Pattern A: */docs/* # Matches 100 URLs
10
-
Pattern B: */docs/api/* # Rejected - overlaps with Pattern A
11
-
Pattern C: */blog/* # Accepted - no overlap
12
-
```
13
-
14
-
Benefits:
15
-
- Clear ownership
16
-
- Predictable effects
17
-
- Simple conflict resolution
18
-
- Easy to debug
19
-
20
-
Drawbacks:
21
-
- Less flexible
22
-
- May require many specific patterns
23
-
- May need pattern deletion/recreation to modify rules
24
-
25
-
## Strategy 2: Smallest Set Priority
8
+
When multiple patterns match a URL, the system:
9
+
1. Counts how many total URLs each pattern matches
10
+
2. Compares the counts
11
+
3. Applies the pattern that matches the fewest URLs
26
12
27
-
Multiple patterns can match same URLs. Pattern affecting smallest URL set takes precedence.
28
-
29
-
Example:
13
+
### Example
30
14
```
31
15
Pattern A: */docs/* # Matches 100 URLs
32
16
Pattern B: */docs/api/* # Matches 20 URLs
@@ -37,42 +21,28 @@ For URL "/docs/api/v2/users":
37
21
- Pattern C wins (5 URLs < 20 URLs < 100 URLs)
38
22
```
39
23
40
-
Benefits:
41
-
- More flexible rule creation
42
-
- Natural handling of specificity
43
-
44
-
Drawbacks:
45
-
- Complex precedence rules
46
-
- Pattern effects can change as URL sets grow
47
-
- Harder to predict/debug
48
-
- Performance impact from URL set size calculations
0 commit comments