Skip to content

Commit 99cfb9f

Browse files
authored
Merge pull request #1134 from NASA-IMPACT/1133-refactor-indexing-statuses-logic
1133 refactor indexing statuses logic
2 parents febee1b + 85574aa commit 99cfb9f

14 files changed

+782
-148
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Generated by Django 4.2.9 on 2024-12-10 19:18
2+
3+
from django.db import migrations, models
4+
5+
6+
class Migration(migrations.Migration):
7+
8+
dependencies = [
9+
("sde_collections", "0072_collection_reindexing_status_reindexinghistory"),
10+
]
11+
12+
operations = [
13+
migrations.AlterField(
14+
model_name="collection",
15+
name="workflow_status",
16+
field=models.IntegerField(
17+
choices=[
18+
(1, "Research in Progress"),
19+
(2, "Ready for Engineering"),
20+
(3, "Engineering in Progress"),
21+
(4, "Ready for Curation"),
22+
(5, "Curation in Progress"),
23+
(6, "Curated"),
24+
(7, "Quality Fixed"),
25+
(8, "Secret Deployment Started"),
26+
(9, "Secret Deployment Failed"),
27+
(10, "Ready for LRM Quality Check"),
28+
(11, "Ready for Quality Check"),
29+
(12, "QC: Failed"),
30+
(18, "QC: Minor Issues"),
31+
(13, "QC: Perfect"),
32+
(14, "Prod: Perfect"),
33+
(15, "Prod: Minor Issues"),
34+
(16, "Prod: Major Issues"),
35+
(17, "Code Merge Pending"),
36+
(19, "Delete from Prod"),
37+
(20, "Indexing Finished on LRM Dev"),
38+
],
39+
default=1,
40+
),
41+
),
42+
migrations.AlterField(
43+
model_name="workflowhistory",
44+
name="old_status",
45+
field=models.IntegerField(
46+
choices=[
47+
(1, "Research in Progress"),
48+
(2, "Ready for Engineering"),
49+
(3, "Engineering in Progress"),
50+
(4, "Ready for Curation"),
51+
(5, "Curation in Progress"),
52+
(6, "Curated"),
53+
(7, "Quality Fixed"),
54+
(8, "Secret Deployment Started"),
55+
(9, "Secret Deployment Failed"),
56+
(10, "Ready for LRM Quality Check"),
57+
(11, "Ready for Quality Check"),
58+
(12, "QC: Failed"),
59+
(18, "QC: Minor Issues"),
60+
(13, "QC: Perfect"),
61+
(14, "Prod: Perfect"),
62+
(15, "Prod: Minor Issues"),
63+
(16, "Prod: Major Issues"),
64+
(17, "Code Merge Pending"),
65+
(19, "Delete from Prod"),
66+
(20, "Indexing Finished on LRM Dev"),
67+
],
68+
null=True,
69+
),
70+
),
71+
migrations.AlterField(
72+
model_name="workflowhistory",
73+
name="workflow_status",
74+
field=models.IntegerField(
75+
choices=[
76+
(1, "Research in Progress"),
77+
(2, "Ready for Engineering"),
78+
(3, "Engineering in Progress"),
79+
(4, "Ready for Curation"),
80+
(5, "Curation in Progress"),
81+
(6, "Curated"),
82+
(7, "Quality Fixed"),
83+
(8, "Secret Deployment Started"),
84+
(9, "Secret Deployment Failed"),
85+
(10, "Ready for LRM Quality Check"),
86+
(11, "Ready for Quality Check"),
87+
(12, "QC: Failed"),
88+
(18, "QC: Minor Issues"),
89+
(13, "QC: Perfect"),
90+
(14, "Prod: Perfect"),
91+
(15, "Prod: Minor Issues"),
92+
(16, "Prod: Major Issues"),
93+
(17, "Code Merge Pending"),
94+
(19, "Delete from Prod"),
95+
(20, "Indexing Finished on LRM Dev"),
96+
],
97+
default=1,
98+
),
99+
),
100+
]
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Generated by Django 4.2.9 on 2024-12-11 02:41
2+
3+
from django.db import migrations, models
4+
5+
6+
class Migration(migrations.Migration):
7+
8+
dependencies = [
9+
("sde_collections", "0073_alter_collection_workflow_status_and_more"),
10+
]
11+
12+
operations = [
13+
migrations.AlterField(
14+
model_name="collection",
15+
name="reindexing_status",
16+
field=models.IntegerField(
17+
choices=[
18+
(1, "Re-Indexing Not Needed"),
19+
(2, "Re-Indexing Needed"),
20+
(3, "Re-Indexing Finished"),
21+
(4, "Ready for Re-Curation"),
22+
(5, "Re-Curation Finished"),
23+
(6, "Re-Indexed on Prod"),
24+
],
25+
default=1,
26+
verbose_name="Reindexing Status",
27+
),
28+
),
29+
migrations.AlterField(
30+
model_name="reindexinghistory",
31+
name="old_status",
32+
field=models.IntegerField(
33+
choices=[
34+
(1, "Re-Indexing Not Needed"),
35+
(2, "Re-Indexing Needed"),
36+
(3, "Re-Indexing Finished"),
37+
(4, "Ready for Re-Curation"),
38+
(5, "Re-Curation Finished"),
39+
(6, "Re-Indexed on Prod"),
40+
],
41+
null=True,
42+
),
43+
),
44+
migrations.AlterField(
45+
model_name="reindexinghistory",
46+
name="reindexing_status",
47+
field=models.IntegerField(
48+
choices=[
49+
(1, "Re-Indexing Not Needed"),
50+
(2, "Re-Indexing Needed"),
51+
(3, "Re-Indexing Finished"),
52+
(4, "Ready for Re-Curation"),
53+
(5, "Re-Curation Finished"),
54+
(6, "Re-Indexed on Prod"),
55+
],
56+
default=1,
57+
),
58+
),
59+
]

sde_collections/models/README.md

Lines changed: 10 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,78 +1,15 @@
1-
# URL Pattern Management System
1+
# COSMOS Curation System
22

3-
## Overview
4-
This system provides a framework for managing and curating collections of URLs through pattern-based rules. It enables systematic modification, categorization, and filtering of URLs while maintaining a clear separation between work-in-progress changes and production content.
5-
6-
## Core Concepts
7-
8-
### URL States
9-
Content progresses through three states:
10-
- **Dump URLs**: Raw content from initial scraping/indexing
11-
- **Delta URLs**: Work-in-progress changes and modifications
12-
- **Curated URLs**: Production-ready, approved content
13-
14-
### Pattern Types
15-
- **Include/Exclude Patterns**: Control which URLs are included in collections
16-
- Include patterns always override exclude patterns
17-
- Use wildcards for matching multiple URLs
18-
19-
- **Modification Patterns**: Change URL properties
20-
- Title patterns modify final titles shown in search results
21-
- Document type patterns affect which tab the URL appears under
22-
- Division patterns assign URLs within the Science Knowledge Sources
23-
24-
### Pattern Resolution
25-
The system uses a "smallest set priority" strategy which resolves conflicts by always using the most specific pattern that matches a URL:
26-
- Multiple patterns can match the same URL
27-
- Pattern matching the smallest number of URLs takes precedence
28-
- Applies to title, division, and document type patterns
29-
- More specific patterns naturally override general ones
30-
31-
## Getting Started
32-
33-
To effectively understand this system, we recommend reading through the documentation in the following order:
34-
35-
1. Begin with the Pattern System Overview to learn the fundamental concepts of how patterns work and interact with URLs
36-
2. Next, explore the URL Lifecycle documentation to understand how content moves through different states
37-
3. The Pattern Resolution documentation will show you how the system handles overlapping patterns
38-
4. Learn how to control which URLs appear in your collection with the Include/Exclude patterns guide
39-
5. Finally, review the Pattern Unapplication Logic to understand how pattern removal affects your URLs
40-
41-
Each section builds upon knowledge from previous sections, providing a comprehensive understanding of the system.
3+
A system for managing collections of URLs through pattern-based rules and status workflows.
424

435
## Documentation
446

45-
[Pattern System Overview](./README_PATTERN_SYSTEM.md)
46-
- Core concepts and pattern types
47-
- Pattern lifecycle and effects
48-
- Delta URL generation rules
49-
- Working principles (idempotency, separation of concerns)
50-
- Pattern interaction examples
51-
52-
[URL Lifecycle Management](./README_LIFECYCLE.md)
53-
- Migration process (Dump → Delta)
54-
- Promotion process (Delta → Curated)
55-
- Field handling during transitions
56-
- Pattern application timing
57-
- Data integrity considerations
58-
59-
[Pattern Resolution](./README_PATTERN_RESOLUTION.md)
60-
- Smallest set priority mechanism
61-
- URL counting and precedence
62-
- Performance considerations
63-
- Edge case handling
64-
- Implementation details
65-
66-
[URL Inclusion/Exclusion](./README_INCLUSION.md)
67-
- Wildcard pattern matching
68-
- Include/exclude precedence
69-
- Example pattern configurations
70-
- Best practices
71-
- Common pitfalls and solutions
727

73-
[Pattern Unapplication Logic](./README_UNAPPLY_LOGIC.md)
74-
- Pattern removal handling
75-
- Delta management during unapplication
76-
- Manual change preservation
77-
- Cleanup procedures
78-
- Edge case handling
8+
- [URL Pattern Overview](./README_PATTERN_OVERVIEW.md) - Core pattern system for URL filtering and modification
9+
- [Pattern System Details](./README_PATTERN_SYSTEM.md)
10+
- [URL Lifecycle Management](./README_LIFECYCLE.md)
11+
- [Pattern Resolution](./README_PATTERN_RESOLUTION.md)
12+
- [URL Inclusion/Exclusion](./README_INCLUSION.md)
13+
- [Pattern Unapplication Logic](./README_UNAPPLY_LOGIC.md)
14+
- [Collection Status Workflows](./README_STATUS_TRIGGERS.md) - Collection progression and automated triggers
15+
- [Reindexing Status System](./README_REINDEXING_STATUSES.md) - Status management for reindexing collections
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# URL Pattern Management System
2+
3+
## Overview
4+
This system provides a framework for managing and curating collections of URLs through pattern-based rules. It enables systematic modification, categorization, and filtering of URLs while maintaining a clear separation between work-in-progress changes and production content.
5+
6+
## Core Concepts
7+
8+
### URL States
9+
Content progresses through three states:
10+
- **Dump URLs**: Raw content from initial scraping/indexing
11+
- **Delta URLs**: Work-in-progress changes and modifications
12+
- **Curated URLs**: Production-ready, approved content
13+
14+
### Pattern Types
15+
- **Include/Exclude Patterns**: Control which URLs are included in collections
16+
- Include patterns always override exclude patterns
17+
- Use wildcards for matching multiple URLs
18+
19+
- **Modification Patterns**: Change URL properties
20+
- Title patterns modify final titles shown in search results
21+
- Document type patterns affect which tab the URL appears under
22+
- Division patterns assign URLs within the Science Knowledge Sources
23+
24+
### Pattern Resolution
25+
The system uses a "smallest set priority" strategy which resolves conflicts by always using the most specific pattern that matches a URL:
26+
- Multiple patterns can match the same URL
27+
- Pattern matching the smallest number of URLs takes precedence
28+
- Applies to title, division, and document type patterns
29+
- More specific patterns naturally override general ones
30+
31+
## Getting Started
32+
33+
To effectively understand this system, we recommend reading through the documentation in the following order:
34+
35+
1. Begin with the Pattern System Overview to learn the fundamental concepts of how patterns work and interact with URLs
36+
2. Next, explore the URL Lifecycle documentation to understand how content moves through different states
37+
3. The Pattern Resolution documentation will show you how the system handles overlapping patterns
38+
4. Learn how to control which URLs appear in your collection with the Include/Exclude patterns guide
39+
5. Finally, review the Pattern Unapplication Logic to understand how pattern removal affects your URLs
40+
41+
Each section builds upon knowledge from previous sections, providing a comprehensive understanding of the system.
42+
43+
## Documentation
44+
45+
[Pattern System Overview](./README_PATTERN_SYSTEM.md)
46+
- Core concepts and pattern types
47+
- Pattern lifecycle and effects
48+
- Delta URL generation rules
49+
- Working principles (idempotency, separation of concerns)
50+
- Pattern interaction examples
51+
52+
[URL Lifecycle Management](./README_LIFECYCLE.md)
53+
- Migration process (Dump → Delta)
54+
- Promotion process (Delta → Curated)
55+
- Field handling during transitions
56+
- Pattern application timing
57+
- Data integrity considerations
58+
59+
[Pattern Resolution](./README_PATTERN_RESOLUTION.md)
60+
- Smallest set priority mechanism
61+
- URL counting and precedence
62+
- Performance considerations
63+
- Edge case handling
64+
- Implementation details
65+
66+
[URL Inclusion/Exclusion](./README_INCLUSION.md)
67+
- Wildcard pattern matching
68+
- Include/exclude precedence
69+
- Example pattern configurations
70+
- Best practices
71+
- Common pitfalls and solutions
72+
73+
[Pattern Unapplication Logic](./README_UNAPPLY_LOGIC.md)
74+
- Pattern removal handling
75+
- Delta management during unapplication
76+
- Manual change preservation
77+
- Cleanup procedures
78+
- Edge case handling
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Collection Status Workflows
2+
3+
This document outlines the automated workflows triggered by status changes in Collections.
4+
5+
## Workflow Status Transitions
6+
7+
Collections progress through workflow statuses that trigger specific automated actions:
8+
9+
### Initial Flow
10+
1. `RESEARCH_IN_PROGRESS``READY_FOR_ENGINEERING`
11+
- Triggers: Creation of initial scraper and indexer configs
12+
13+
2. `READY_FOR_ENGINEERING``ENGINEERING_IN_PROGRESS``INDEXING_FINISHED_ON_DEV`
14+
- When indexing finishes, a developer changes the status to `INDEXING_FINISHED_ON_DEV`
15+
- This will trigger a full text fetch from LRM dev
16+
- If the fetch completes successfully, it updates the status to `READY_FOR_CURATION`
17+
18+
3. `READY_FOR_CURATION`
19+
- Triggers creation/update of plugin config
20+
21+
4. `READY_FOR_CURATION``CURATION_IN_PROGRESS``CURATED`
22+
- When curation finishes, the curator marks the collection as `CURATED`
23+
- This triggers the promotion of DeltaUrls to CuratedUrls
24+
25+
5. Quality Check Flow:
26+
- During quality checks the curator can put the status as `QUALITY_CHECK_PERFECT/MINOR`
27+
- These passing quality statuses will trigger the addition of the collection to the public query
28+
- After the PR is merged and SDE Prod server is updated with the latest code, this collection will become visible
29+
30+
### Reindexing Flow
31+
32+
After the main workflow, collections can enter a reindexing cycle:
33+
34+
1. `REINDEXING_NOT_NEEDED``REINDEXING_NEEDED_ON_DEV`
35+
- By default collections do not need reindexing
36+
- They can be manually marked as reindexing needed on dev
37+
38+
2. `REINDEXING_NEEDED_ON_DEV``REINDEXING_FINISHED_ON_DEV`
39+
- When re-indexing finishes, a developer changes the status to `REINDEXING_FINISHED_ON_DEV`
40+
- This will trigger a full text fetch from LRM dev
41+
- If the fetch completes successfully, it updates the status to `REINDEXING_READY_FOR_CURATION`
42+
43+
3. `REINDEXING_READY_FOR_CURATION``REINDEXING_CURATED`
44+
- When re-curation finishes, the curator marks the collection as `REINDEXING_CURATED`
45+
- This triggers the promotion of DeltaUrls to CuratedUrls
46+
47+
4. `REINDEXING_CURATED``REINDEXING_INDEXED_ON_PROD`
48+
- After the collection has been indexed on Prod, a dev marks it as `REINDEXING_INDEXED_ON_PROD`
49+
50+
## Full Text Import Process
51+
52+
The full text import process integrates with both workflows:
53+
54+
1. Clears existing DumpUrls for the collection
55+
2. Fetches and processes new full text data in batches
56+
3. Creates new DumpUrls
57+
4. Migrates DumpUrls to DeltaUrls
58+
5. Updates collection status based on context:
59+
- In main workflow: Updates to `READY_FOR_CURATION`
60+
- In reindexing: Updates to `REINDEXING_READY_FOR_CURATION`
61+
62+
## Key Models and Files
63+
64+
- `Collection`: Main model handling status transitions
65+
- `WorkflowStatusChoices`: Enum defining main workflow states
66+
- `ReindexingStatusChoices`: Enum defining reindexing states
67+
- `tasks.py`: Contains full text import logic and status updates
68+
- Signal handler in Collection model manages status change triggers

0 commit comments

Comments
 (0)