Skip to content

Commit 7ae5fde

Browse files
committed
Merge branch '1232-process-the-full-text-dump' of github.com:NASA-IMPACT/COSMOS into 1232-process-the-full-text-dump
2 parents 5279857 + 2ff7dce commit 7ae5fde

File tree

79 files changed

+5644
-1057
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+5644
-1057
lines changed

.github/workflows/run_full_test_suite.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ on:
44
pull_request:
55
branches:
66
- dev
7+
paths-ignore:
8+
- '**/*.md'
79

810
jobs:
911
run-tests:

CHANGELOG.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,24 @@ For each PR made, an entry should be added to this changelog. It should contain
1212
- etc.
1313

1414
## Changelog
15+
16+
- 1209-bug-fix-document-type-creator-form
17+
- Description: The dropdown on the pattern creation form needs to be set as multi as the default option since this is why the doc type creator form is used for the majority of multi-URL pattern creations. This should be applied to doc types, division types, and titles as well.
18+
- Changes:
19+
- Set the default value for `match_pattern_type` in `BaseMatchPattern` class is set to `2`
20+
- Changed `test_create_simple_exclude_pattern` test within `TestDeltaExcludePatternBasics`
21+
- Changed `test_create_division_pattern` and `test_create_document_type_pattern_single` within `TestFieldModifierPatternBasics`
22+
23+
- 1052-update-cosmos-to-create-jobs-for-scrapers-and-indexers
24+
- Description: The original automation set up to generate the scrapers and indexers automatically based on a collection workflow status change needed to be updated to more accurately reflect the curation workflow. It would also be good to generate the jobs during this process to streamline the same.
25+
- Changes:
26+
- Updated function nomenclature. Scrapers are Sinequa connector configurations that are used to scrape all the URLs prior to curation. Indexers are Sienqua connector configurations that are used to scrape the URLs post to curation, which would be used to index content on production. Jobs are used to trigger the connectors which are included as parts of joblists.
27+
- Parameterized the convert_template_to_job method to include the job_source to streamline the value added to the `<Collection>` tag in the job XML.
28+
- Updated the fields that are pertinenet to transfer from a scraper to an indexer. Also added a third level of XML processing to facilitate the same.
29+
- scraper_template.xml and indexer_template.xml now contains the templates used for the respective configuration generation.
30+
- Deleted the redundant webcrawler_initial_crawl.xml file.
31+
- Added and updated tests on workflow status triggers.
32+
1533
- 2889-serialize-the-tdamm-tags
1634
- Description: Have TDAMM serialzed in a specific way and exposed via the Curated URLs API to be consumed into SDE Test/Prod
1735
- Changes:
@@ -35,3 +53,144 @@ For each PR made, an entry should be added to this changelog. It should contain
3553
- Defined a class `HTMLFreeCharField` which inherits `serializers.CharField`
3654
- Used regex to catch any HTML content comming in as an input to form fields
3755
- Called this class within the serializer for necessary fields
56+
57+
- 1030-resolve-0-value-document-type-in-nasa_science
58+
- Description: Around 2000 of the docs coming out of the COSMOS api for nasa_science have a doc type value of 0.
59+
- Changes:
60+
- Added `obj.document_type != 0` as a condition in the `get_document_type` method within the `CuratedURLAPISerializer`
61+
62+
- 1014-add-logs-when-importing-urls-so-we-know-how-many-were-expected-how-many-succeeded-and-how-many-failed
63+
- Description: When URLs of a given collection are imported into COSMOS, a Slack notification is sent. This notification includes the name of the collection imported,count of the existing curated URLs, total URLs count as per the server, URLs successfully imported from the server, delta URLs identified and delta URLs marked for deletion.
64+
- Changes:
65+
- The get_full_texts() function in sde_collections/sinequa_api.py is updated to yeild total_count along with rows.
66+
- fetch_and_replace_full_text() function in sde_collections/tasks.py captures the total_server_count and triggers send_detailed_import_notification().
67+
- Added a function send_detailed_import_notification() in sde_collections/utils/slack_utils.py to structure the notification to be sent.
68+
- Updated the associated tests effected due to inclusion of this functionality.
69+
70+
- 3228-bugfix-preserve-scroll-position--document-type-selection-behavior-on-individual-urls
71+
- Description: Upon selecting a document type on any individual URL, the page refreshes and returns to the top. This is not necessarily a bug but an inconvenience, especially when working at the bottom of the page. Fix the JS code.
72+
- Changes:
73+
- Added a constant `scrollPosition` within `postDocumentTypePatterns` to store the y coordinate postion on the page
74+
- Modified the ajax relaod to navigate to this position upon posting/saving the document type changes.
75+
76+
- 3227-bugfix-title-patterns-selecting-multi-url-pattern-does-nothing
77+
- Description: When selecting options from the match pattern type filter, the system does not filter the results as expected. Instead of displaying only the chosen variety of patterns, it continues to show all patterns.
78+
- Changes:
79+
- In `title_patterns_table` definition, corrected the column reference
80+
- Made `match_pattern_type` searchable
81+
- Corrected the column references and made code consistent on all the other tables, i.e., `exclude_patterns_table`, `include_patterns_table`, `division_patterns_table` and `document_type_patterns_table`
82+
83+
- 1190-add-tests-for-job-generation-pipeline
84+
- Description: Tests have been added to enhance coverage for the config and job creation pipeline, alongside comprehensive tests for XML processing.
85+
- Changes:
86+
- Added config_generation/tests/test_config_generation_pipeline.py which tests the config and job generation pipeline, ensuring all components interact correctly
87+
- config_generation/tests/test_db_to_xml.py is updated to include comprehensive tests for XML Processing
88+
89+
- 1001-tests-for-critical-functionalities
90+
- Description: Critical functionalities have been identified and listed, and critical areas lacking tests listed
91+
- Changes:
92+
- Integrated coverage.py as an indicative tool in the workflow for automated coverage reports on PRs, with separate display from test results.
93+
- Introduced docs/architecture-decisions/testing_strategy.md, which includes the coverage report, lists critical areas, and specifically identifies those critical areas that are untested or under-tested.
94+
95+
- 1192-finalize-the-infrastructure-for-frontend-testing
96+
- Description: Set up comprehensive frontend testing infrastructure using Selenium WebDriver with Chrome, establishing a foundation for automated UI testing.
97+
- Changes:
98+
- Added Selenium testing dependency to `requirements/local.txt`
99+
- Updated Dockerfile to support Chrome and ChromeDriver
100+
- Created BaseTestCase and AuthenticationMixin for reusable test components
101+
- Implemented core authentication test suite
102+
103+
- 1195-implement-unit-test-for-forms-on-the-frontend
104+
- Description: Implemented comprehensive frontend test suite covering authentication, collection management, search functionality, and pattern application forms.
105+
- Changes:
106+
- Added tests for authentication flows
107+
- Implemented collection display and data table tests
108+
- Added universal search functionality tests
109+
- Created search pane filter tests
110+
- Added pattern application form tests with validation checks
111+
112+
- 1101-bug-fix-quotes-not-escaped-in-titles
113+
- Description: Title rules that include single quotes show up correctly in the sinequa frontend (and the COSMOS api) but not in the delta urls page.
114+
- Changes:
115+
- Added `escapeHtml` function in the `delta_url_list.js` file to handle special character escaping correctly.
116+
- Called this function while retrieving the titles in `getGeneratedTitleColumn()` and `getCuratedGeneratedTitleColumn()` functions.
117+
118+
- 1240-fix-code-scanning-alert-inclusion-of-functionality-from-an-untrusted-source
119+
- Description: Ensured all external resources load securely by switching to HTTPS and adding Subresource Integrity (SRI) checks.
120+
- Changes:
121+
- Replaced protocol‑relative URLs with HTTPS.
122+
- Added SRI (integrity) and crossorigin attributes to external script tags.
123+
124+
- 1196-arrange-the-show-100-csv-customize-columns-boxes-to-be-in-one-line-on-the-delta-urls-page
125+
changelog-update-Issue-1001
126+
- Description: Formatting the buttons - 'Show 100','CSV' and 'Customize Columns' to be on a single line for an optimal use of space.
127+
- Changes:
128+
- Updated delta_url_list.css and delta_url_list.js files with necessary modifications
129+
130+
- 1246-minor-enhancement-document-type-pattern-form-require-document-type-or-show-appropriate-error
131+
- Description: In the Document Type Pattern Form, if the user does not select a Document Type while filling out the form, an appropriate error message is displayed.
132+
- Changes:
133+
- Added a JavaScript validation check on form submission to ensure the document type (stored in a hidden input) is not empty.
134+
- Display an error message and prevent form submission if the field is empty.
135+
136+
- 1249-add-https-link-to-cors_allowed_origins-for-sde-lrm
137+
- Description: The feedback form API was throwing CORS errors and to rectify that, we need to add the apt https link for sde-lrm.
138+
- Changes:
139+
- Added `https://sde-lrm.nasa-impact.net` to `CORS_ALLOWED_ORIGINS` in the base settings.
140+
141+
- 1252-document-type-filter-not-working-in-delta-urls-page
142+
- Description: Fixed document type filtering functionality in the "Document Type Patterns" tab in Delta URLs page.
143+
- Changes:
144+
- Added a new event listener to the Document Type Patterns dropdown to trigger the filtering of the table results based on the selected value.
145+
146+
- 1251-column-sorting-issue-curated-urls-count-sorts-by-delta-urls-count
147+
- Description: Fixed incorrect sorting behavior in Collections table where sorting by Curated URLs column was not working as expected.
148+
- Changes:
149+
- Added `data-order` attribute to URL count columns for proper numeric sorting
150+
- Updated SearchPane comparisons to use `@data-order` values instead of string-based loose equality checks to ensure correct numeric filtering
151+
152+
- 1182-ml-classification-queue
153+
- Description: The inference API will be providing confidence levels to the classification results to COSMOS. We require a robust job processing mechanism to batch URLs based on the load the API can handle, and track every individual job sent to the API and ultimately evaluate the status of jobs tied to each collection based on the results retrieved. This also needs to take the translation of the classification labels from the API to the tags used internally within COSMOS.
154+
- Changes:
155+
- New environment values have been created called `INFERENCE_API_URL` and `TDAMM_CLASSIFICATION_THRESHOLD` set in base settings.
156+
- New models added:
157+
- ModelVersion: Tracking system for multiple versions of classification models with API identifiers
158+
- InferenceJob: Manages inference jobs for collections of URLs with a specific model version
159+
- ExternalJob: Represents a batched job sent to the inference API, with multiple ExternalJobs per InferenceJob
160+
- Status Tracking: Enum classes for job status tracking (queued, pending, completed, failed, cancelled, etc.)
161+
- BatchProcessor: Handles batching of URLs for efficient API processing
162+
- Text Length Management: Smart batching based on total text length with configurable maximum (default 10,000 chars)
163+
- Oversized Text Handling: Automatic truncation of URLs that exceed maximum batch size
164+
- Iterator Management: Safe handling of QuerySet iterators including proper cleanup
165+
- InferenceAPIClient: Handles direct interaction with the Inference API
166+
- Model Management: Loading, unloading, and status checking for models
167+
- Job Submission: Support for batch submission with proper error handling
168+
- Retry Logic: Robust retry mechanisms for model loading operations
169+
- Health Checking: API health verification before operations
170+
- ClassificationThresholdProcessor: A class to handle the class-based thresholding of classification results
171+
- Separte classmethods for tdamm and division classifiers
172+
- Config file to handle the thresholds for each class
173+
- Celery Integration: Scheduled processing of inference job queue with configurable interval, executes `process_inference_job_queue`
174+
- Time-Based Execution: Configured to run during off-hours on weekdays (6pm-7am) and all the time on weekends
175+
- Concurrency and Safety:
176+
- AdvisoryLock: Utility class for managing Postgres advisory locks
177+
- Transaction Management: Context managers for safe lock acquisition and release
178+
- ID Generation: Hash-based lock ID generation from string names
179+
- Updated TDAMMTags to remove redundant tags (MMA_M_EM, MMA_O_BI, MMA_O_BH, MMA_O_N) and add a missing one (MMA_S_FBOT). Also updated the enum value for the NOT_TDAMM tag.
180+
- Classification results coming in from the inference API will need to be translated to the TDAMMTags model we have, and that is handled by the `map_classification_to_tdamm_tags` in the `classification_utils` that will contain any relevant utilities for subsequent classifiers
181+
- The collections that will be run through the pipeline are limited to the following right now:
182+
- imagine_the_universe
183+
- physics_of_the_cosmos
184+
- stsci_space_telescope_science_institute
185+
- Once the front end has been updated to allow for tag edits, all astrophysics collections will be marked to be run through the pipeline
186+
187+
- 1298-csv-export-command-for-urls
188+
- Description: Added a new Django management command to export URLs (DumpUrl, DeltaUrl, or CuratedUrl) to CSV files for analysis or backup purposes. The command allows filtering by collection and provides configurable export options.
189+
- Changes:
190+
- Created a new management command `export_urls_to_csv.py` to extract URL data to CSV format
191+
- Implemented options to filter exports by model type and specific collections
192+
- Added support for excluding full text content with the `--full_text` flag to reduce file size
193+
- Included proper handling for paired fields (tdamm_tag_manual, tdamm_tag_ml)
194+
- Added automatic creation of a dedicated `csv_exports` directory for storing export files
195+
- Implemented batched processing to efficiently handle large datasets
196+
- Added progress reporting during export operations

compose/local/django/start

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
#compose/local/django/start
12
#!/bin/bash
23

34
set -o errexit

compose/production/django/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# compose/production/django/Dockerfile
12
# define an alias for the specfic python version used in this file.
23
FROM python:3.10.14-slim-bullseye AS python
34

compose/production/django/start

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# compose/production/django/start
12
#!/bin/bash
23

34
set -o errexit

compose/production/traefik/traefik.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# compose/production/traefik/traefik.yml
12
log:
23
level: INFO
34

config/celery.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# config/celery.py
2+
import os
3+
4+
from celery import Celery
5+
6+
# Set the default Django settings module
7+
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "config.settings.local")
8+
9+
app = Celery("cosmos")
10+
11+
# Configure Celery using Django settings
12+
app.config_from_object("django.conf:settings", namespace="CELERY")
13+
14+
# Load task modules from all registered Django app configs
15+
app.autodiscover_tasks()

config/settings/base.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@
8484
"feedback",
8585
"sde_collections",
8686
"sde_indexing_helper.users",
87+
"inference",
8788
]
8889

8990
# https://docs.djangoproject.com/en/dev/ref/settings/#installed-apps
@@ -92,6 +93,7 @@
9293
CORS_ALLOWED_ORIGINS = [
9394
"http://localhost:3000",
9495
"http://sde-lrm.nasa-impact.net",
96+
"https://sde-lrm.nasa-impact.net",
9597
"https://sde-qa.nasa-impact.net",
9698
"https://sciencediscoveryengine.test.nasa.gov",
9799
"https://sciencediscoveryengine.nasa.gov",
@@ -288,11 +290,9 @@
288290
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#std:setting-result_serializer
289291
CELERY_RESULT_SERIALIZER = "json"
290292
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-time-limit
291-
# TODO: set to whatever value is adequate in your circumstances
292-
CELERY_TASK_TIME_LIMIT = 5 * 60
293+
CELERY_TASK_TIME_LIMIT = 30 * 60
293294
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-soft-time-limit
294-
# TODO: set to whatever value is adequate in your circumstances
295-
CELERY_TASK_SOFT_TIME_LIMIT = 60
295+
CELERY_TASK_SOFT_TIME_LIMIT = 25 * 60
296296
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#beat-scheduler
297297
CELERY_BEAT_SCHEDULER = "django_celery_beat.schedulers:DatabaseScheduler"
298298
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#worker-send-task-events
@@ -349,3 +349,5 @@
349349
LRM_QA_PASSWORD = env("LRM_QA_PASSWORD")
350350
LRM_DEV_TOKEN = env("LRM_DEV_TOKEN")
351351
XLI_TOKEN = env("XLI_TOKEN")
352+
INFERENCE_API_URL = env("INFERENCE_API_URL", default="http://host.docker.internal:8000")
353+
TDAMM_CLASSIFICATION_THRESHOLD = env("TDAMM_CLASSIFICATION_THRESHOLD", default="0.5")

config/settings/local.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,11 @@
5252

5353
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-eager-propagates
5454
CELERY_TASK_EAGER_PROPAGATES = True
55+
56+
# Inference API
57+
# ------------------------------------------------------------------------------
58+
INFERENCE_API_URL = env("INFERENCE_API_URL", default="http://host.docker.internal:8000")
59+
5560
# Your stuff...
5661
# ------------------------------------------------------------------------------
5762

config/settings/production.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,9 @@
166166
traces_sample_rate=env.float("SENTRY_TRACES_SAMPLE_RATE", default=0.0),
167167
)
168168

169+
# Inference API
170+
# ------------------------------------------------------------------------------
171+
INFERENCE_API_URL = env("INFERENCE_API_URL", default="http://172.17.0.1:8000")
169172

170173
# Your stuff...
171174
# ------------------------------------------------------------------------------

0 commit comments

Comments
 (0)