Skip to content

Commit e298f56

Browse files
authored
Merge pull request #1254 from NASA-IMPACT/dev
merge bugfixes and ml integration to staging
2 parents 1dc106e + 460565d commit e298f56

File tree

66 files changed

+4651
-904
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+4651
-904
lines changed

CHANGELOG.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,24 @@ For each PR made, an entry should be added to this changelog. It should contain
1212
- etc.
1313

1414
## Changelog
15+
16+
- 1209-bug-fix-document-type-creator-form
17+
- Description: The dropdown on the pattern creation form needs to be set as multi as the default option since this is why the doc type creator form is used for the majority of multi-URL pattern creations. This should be applied to doc types, division types, and titles as well.
18+
- Changes:
19+
- Set the default value for `match_pattern_type` in `BaseMatchPattern` class is set to `2`
20+
- Changed `test_create_simple_exclude_pattern` test within `TestDeltaExcludePatternBasics`
21+
- Changed `test_create_division_pattern` and `test_create_document_type_pattern_single` within `TestFieldModifierPatternBasics`
22+
23+
- 1052-update-cosmos-to-create-jobs-for-scrapers-and-indexers
24+
- Description: The original automation set up to generate the scrapers and indexers automatically based on a collection workflow status change needed to be updated to more accurately reflect the curation workflow. It would also be good to generate the jobs during this process to streamline the same.
25+
- Changes:
26+
- Updated function nomenclature. Scrapers are Sinequa connector configurations that are used to scrape all the URLs prior to curation. Indexers are Sienqua connector configurations that are used to scrape the URLs post to curation, which would be used to index content on production. Jobs are used to trigger the connectors which are included as parts of joblists.
27+
- Parameterized the convert_template_to_job method to include the job_source to streamline the value added to the `<Collection>` tag in the job XML.
28+
- Updated the fields that are pertinenet to transfer from a scraper to an indexer. Also added a third level of XML processing to facilitate the same.
29+
- scraper_template.xml and indexer_template.xml now contains the templates used for the respective configuration generation.
30+
- Deleted the redundant webcrawler_initial_crawl.xml file.
31+
- Added and updated tests on workflow status triggers.
32+
1533
- 2889-serialize-the-tdamm-tags
1634
- Description: Have TDAMM serialzed in a specific way and exposed via the Curated URLs API to be consumed into SDE Test/Prod
1735
- Changes:
@@ -36,13 +54,38 @@ For each PR made, an entry should be added to this changelog. It should contain
3654
- Used regex to catch any HTML content comming in as an input to form fields
3755
- Called this class within the serializer for necessary fields
3856

57+
- 1030-resolve-0-value-document-type-in-nasa_science
58+
- Description: Around 2000 of the docs coming out of the COSMOS api for nasa_science have a doc type value of 0.
59+
- Changes:
60+
- Added `obj.document_type != 0` as a condition in the `get_document_type` method within the `CuratedURLAPISerializer`
61+
62+
- 1014-add-logs-when-importing-urls-so-we-know-how-many-were-expected-how-many-succeeded-and-how-many-failed
63+
- Description: When URLs of a given collection are imported into COSMOS, a Slack notification is sent. This notification includes the name of the collection imported,count of the existing curated URLs, total URLs count as per the server, URLs successfully imported from the server, delta URLs identified and delta URLs marked for deletion.
64+
- Changes:
65+
- The get_full_texts() function in sde_collections/sinequa_api.py is updated to yeild total_count along with rows.
66+
- fetch_and_replace_full_text() function in sde_collections/tasks.py captures the total_server_count and triggers send_detailed_import_notification().
67+
- Added a function send_detailed_import_notification() in sde_collections/utils/slack_utils.py to structure the notification to be sent.
68+
- Updated the associated tests effected due to inclusion of this functionality.
69+
70+
- 3228-bugfix-preserve-scroll-position--document-type-selection-behavior-on-individual-urls
71+
- Description: Upon selecting a document type on any individual URL, the page refreshes and returns to the top. This is not necessarily a bug but an inconvenience, especially when working at the bottom of the page. Fix the JS code.
72+
- Changes:
73+
- Added a constant `scrollPosition` within `postDocumentTypePatterns` to store the y coordinate postion on the page
74+
- Modified the ajax relaod to navigate to this position upon posting/saving the document type changes.
75+
3976
- 3227-bugfix-title-patterns-selecting-multi-url-pattern-does-nothing
4077
- Description: When selecting options from the match pattern type filter, the system does not filter the results as expected. Instead of displaying only the chosen variety of patterns, it continues to show all patterns.
4178
- Changes:
4279
- In `title_patterns_table` definition, corrected the column reference
4380
- Made `match_pattern_type` searchable
4481
- Corrected the column references and made code consistent on all the other tables, i.e., `exclude_patterns_table`, `include_patterns_table`, `division_patterns_table` and `document_type_patterns_table`
4582

83+
- 1190-add-tests-for-job-generation-pipeline
84+
- Description: Tests have been added to enhance coverage for the config and job creation pipeline, alongside comprehensive tests for XML processing.
85+
- Changes:
86+
- Added config_generation/tests/test_config_generation_pipeline.py which tests the config and job generation pipeline, ensuring all components interact correctly
87+
- config_generation/tests/test_db_to_xml.py is updated to include comprehensive tests for XML Processing
88+
4689
- 1001-tests-for-critical-functionalities
4790
- Description: Critical functionalities have been identified and listed, and critical areas lacking tests listed
4891
- Changes:
@@ -65,3 +108,32 @@ For each PR made, an entry should be added to this changelog. It should contain
65108
- Added universal search functionality tests
66109
- Created search pane filter tests
67110
- Added pattern application form tests with validation checks
111+
112+
- 1101-bug-fix-quotes-not-escaped-in-titles
113+
- Description: Title rules that include single quotes show up correctly in the sinequa frontend (and the COSMOS api) but not in the delta urls page.
114+
- Changes:
115+
- Added `escapeHtml` function in the `delta_url_list.js` file to handle special character escaping correctly.
116+
- Called this function while retrieving the titles in `getGeneratedTitleColumn()` and `getCuratedGeneratedTitleColumn()` functions.
117+
118+
- 1240-fix-code-scanning-alert-inclusion-of-functionality-from-an-untrusted-source
119+
- Description: Ensured all external resources load securely by switching to HTTPS and adding Subresource Integrity (SRI) checks.
120+
- Changes:
121+
- Replaced protocol‑relative URLs with HTTPS.
122+
- Added SRI (integrity) and crossorigin attributes to external script tags.
123+
124+
- 1196-arrange-the-show-100-csv-customize-columns-boxes-to-be-in-one-line-on-the-delta-urls-page
125+
changelog-update-Issue-1001
126+
- Description: Formatting the buttons - 'Show 100','CSV' and 'Customize Columns' to be on a single line for an optimal use of space.
127+
- Changes:
128+
- Updated delta_url_list.css and delta_url_list.js files with necessary modifications
129+
130+
- 1246-minor-enhancement-document-type-pattern-form-require-document-type-or-show-appropriate-error
131+
- Description: In the Document Type Pattern Form, if the user does not select a Document Type while filling out the form, an appropriate error message is displayed.
132+
- Changes:
133+
- Added a JavaScript validation check on form submission to ensure the document type (stored in a hidden input) is not empty.
134+
- Display an error message and prevent form submission if the field is empty.
135+
136+
- 1249-add-https-link-to-cors_allowed_origins-for-sde-lrm
137+
- Description: The feedback form API was throwing CORS errors and to rectify that, we need to add the apt https link for sde-lrm.
138+
- Changes:
139+
- Added `https://sde-lrm.nasa-impact.net` to `CORS_ALLOWED_ORIGINS` in the base settings.

compose/local/django/start

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
#compose/local/django/start
12
#!/bin/bash
23

34
set -o errexit

compose/production/django/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# compose/production/django/Dockerfile
12
# define an alias for the specfic python version used in this file.
23
FROM python:3.10.14-slim-bullseye AS python
34

compose/production/django/start

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# compose/production/django/start
12
#!/bin/bash
23

34
set -o errexit

compose/production/traefik/traefik.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# compose/production/traefik/traefik.yml
12
log:
23
level: INFO
34

config/celery.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# config/celery.py
2+
import os
3+
4+
from celery import Celery
5+
from celery.schedules import crontab
6+
7+
# Set the default Django settings module
8+
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "config.settings.local")
9+
10+
app = Celery("cosmos")
11+
12+
# Configure Celery using Django settings
13+
app.config_from_object("django.conf:settings", namespace="CELERY")
14+
15+
# Load task modules from all registered Django app configs
16+
app.autodiscover_tasks()
17+
18+
app.conf.beat_schedule = {
19+
"process-inference-queue": {
20+
"task": "inference.tasks.process_inference_job_queue",
21+
# Only run between 6pm and 7am
22+
"schedule": crontab(minute="*/5", hour="18-23,0-6"),
23+
},
24+
}

config/settings/base.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@
8484
"feedback",
8585
"sde_collections",
8686
"sde_indexing_helper.users",
87+
"inference",
8788
]
8889

8990
# https://docs.djangoproject.com/en/dev/ref/settings/#installed-apps
@@ -92,6 +93,7 @@
9293
CORS_ALLOWED_ORIGINS = [
9394
"http://localhost:3000",
9495
"http://sde-lrm.nasa-impact.net",
96+
"https://sde-lrm.nasa-impact.net",
9597
"https://sde-qa.nasa-impact.net",
9698
"https://sciencediscoveryengine.test.nasa.gov",
9799
"https://sciencediscoveryengine.nasa.gov",
@@ -288,11 +290,9 @@
288290
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#std:setting-result_serializer
289291
CELERY_RESULT_SERIALIZER = "json"
290292
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-time-limit
291-
# TODO: set to whatever value is adequate in your circumstances
292-
CELERY_TASK_TIME_LIMIT = 5 * 60
293+
CELERY_TASK_TIME_LIMIT = 30 * 60
293294
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#task-soft-time-limit
294-
# TODO: set to whatever value is adequate in your circumstances
295-
CELERY_TASK_SOFT_TIME_LIMIT = 60
295+
CELERY_TASK_SOFT_TIME_LIMIT = 25 * 60
296296
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#beat-scheduler
297297
CELERY_BEAT_SCHEDULER = "django_celery_beat.schedulers:DatabaseScheduler"
298298
# https://docs.celeryq.dev/en/stable/userguide/configuration.html#worker-send-task-events
@@ -349,3 +349,5 @@
349349
LRM_QA_PASSWORD = env("LRM_QA_PASSWORD")
350350
LRM_DEV_TOKEN = env("LRM_DEV_TOKEN")
351351
XLI_TOKEN = env("XLI_TOKEN")
352+
INFERENCE_API_URL = env("INFERENCE_API_URL", default="http://host.docker.internal:8000")
353+
TDAMM_CLASSIFICATION_THRESHOLD = env("TDAMM_CLASSIFICATION_THRESHOLD", default="0.5")

config_generation/db_to_xml.py

Lines changed: 43 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -148,35 +148,51 @@ def convert_template_to_scraper(self, collection) -> None:
148148
scraper_config = self.update_config_xml()
149149
return scraper_config
150150

151-
def convert_template_to_plugin_indexer(self, scraper_editor) -> None:
151+
def convert_template_to_job(self, collection, job_source) -> None:
152152
"""
153-
assuming this class has been instantiated with the scraper_template.xml
153+
assuming this class has been instantiated with the job_template.xml
154+
"""
155+
self.update_or_add_element_value("Collection", f"/{job_source}/{collection.config_folder}/")
156+
job_config = self.update_config_xml()
157+
return job_config
158+
159+
def convert_template_to_indexer(self, scraper_editor) -> None:
160+
"""
161+
assuming this class has been instantiated with the final_config_template.xml
154162
"""
155163

156164
transfer_fields = [
157-
"KeepHashFragmentInUrl",
158-
"CorrectDomainCookies",
159-
"IgnoreSessionCookies",
160-
"DownloadImages",
161-
"DownloadMedia",
162-
"DownloadCss",
163-
"DownloadFtp",
164-
"DownloadFile",
165-
"IndexJs",
166-
"FollowJs",
167-
"CrawlFlash",
168-
"NormalizeSecureSchemesWhenTestingVisited",
169-
"RetryCount",
170-
"RetryPause",
171-
"AddBaseHref",
172-
"AddMetaContentType",
173-
"NormalizeUrls",
165+
"Throttle",
174166
]
175167

176168
double_transfer_fields = [
177-
("UrlAccess", "AllowXPathCookies"),
178169
("UrlAccess", "UseBrowserForWebRequests"),
179-
("UrlAccess", "UseHttpClientForWebRequests"),
170+
("UrlAccess", "BrowserForWebRequestsReadinessThreshold"),
171+
("UrlAccess", "BrowserForWebRequestsInitialDelay"),
172+
("UrlAccess", "BrowserForWebRequestsMaxTotalDelay"),
173+
("UrlAccess", "BrowserForWebRequestsMaxResourcesDelay"),
174+
("UrlAccess", "BrowserForWebRequestsLogLevel"),
175+
("UrlAccess", "BrowserForWebRequestsViewportWidth"),
176+
("UrlAccess", "BrowserForWebRequestsViewportHeight"),
177+
("UrlAccess", "BrowserForWebRequestsAdditionalJavascript"),
178+
("UrlAccess", "PostLoginUrl"),
179+
("UrlAccess", "PostLoginData"),
180+
("UrlAccess", "GetBeforePostLogin"),
181+
("UrlAccess", "PostLoginAutoRedirect"),
182+
("UrlAccess", "ReLoginCount"),
183+
("UrlAccess", "ReLoginDelay"),
184+
("UrlAccess", "DetectHtmlLoginPattern"),
185+
("IndexerClient", "RetryTimeout"),
186+
("IndexerClient", "RetrySleep"),
187+
]
188+
189+
triple_transfer_fields = [
190+
("UrlAccess", "BrowserLogin", "Activate"),
191+
("UrlAccess", "BrowserLogin", "RemoteDebuggingPort"),
192+
("UrlAccess", "BrowserLogin", "BrowserLogLevel"),
193+
("UrlAccess", "BrowserLogin", "ShowDevTools"),
194+
("UrlAccess", "BrowserLogin", "SuccessCondition"),
195+
("UrlAccess", "BrowserLogin", "CookieFilter"),
180196
]
181197

182198
for field in transfer_fields:
@@ -187,18 +203,15 @@ def convert_template_to_plugin_indexer(self, scraper_editor) -> None:
187203
f"{parent}/{child}", scraper_editor.get_tag_value(f"{parent}/{child}", strict=True)
188204
)
189205

206+
for grandparent, parent, child in triple_transfer_fields:
207+
self.update_or_add_element_value(
208+
f"{grandparent}/{parent}/{child}",
209+
scraper_editor.get_tag_value(f"{grandparent}/{parent}/{child}", strict=True),
210+
)
211+
190212
scraper_config = self.update_config_xml()
191213
return scraper_config
192214

193-
def convert_template_to_indexer(self, collection) -> None:
194-
"""
195-
assuming this class has been instantiated with the indexer_template.xml
196-
"""
197-
self.update_or_add_element_value("Collection", f"/SDE/{collection.config_folder}/")
198-
indexer_config = self.update_config_xml()
199-
200-
return indexer_config
201-
202215
def _mapping_exists(self, new_mapping: ET.Element):
203216
"""
204217
Check if the mapping with given parameters already exists in the XML tree
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
from unittest.mock import MagicMock, call, patch
2+
3+
from django.test import TestCase
4+
5+
from sde_collections.models.collection import Collection
6+
from sde_collections.models.collection_choice_fields import WorkflowStatusChoices
7+
8+
"""
9+
Workflow status change → Opens template → Applies XML transformation → Writes to GitHub.
10+
11+
- When the `workflow_status` changes, it triggers the relevant config creation method.
12+
- The method reads an template and processes it using `XmlEditor`.
13+
- `XmlEditor` modifies the template by injecting collection-specific values and transformations.
14+
- The generated XML is passed to `_write_to_github()`, which commits it directly to GitHub.
15+
16+
Note: This test verifies that the correct methods are triggered and XML content is passed to GitHub.
17+
The actual XML structure and correctness are tested separately in `test_db_xml.py`.
18+
"""
19+
20+
21+
class TestConfigCreation(TestCase):
22+
def setUp(self):
23+
self.collection = Collection.objects.create(
24+
name="Test Collection", division="1", workflow_status=WorkflowStatusChoices.RESEARCH_IN_PROGRESS
25+
)
26+
27+
@patch("sde_collections.utils.github_helper.GitHubHandler") # Mock GitHubHandler
28+
@patch("sde_collections.models.collection.Collection._write_to_github")
29+
@patch("sde_collections.models.collection.XmlEditor")
30+
def test_ready_for_engineering_triggers_config_and_job_creation(
31+
self, MockXmlEditor, mock_write_to_github, MockGitHubHandler
32+
):
33+
"""
34+
When the collection's workflow status is updated to READY_FOR_ENGINEERING,
35+
it should trigger the creation of scraper configuration and job files.
36+
"""
37+
# Mock GitHubHandler to avoid actual API calls
38+
mock_github_instance = MockGitHubHandler.return_value
39+
mock_github_instance.create_file.return_value = None
40+
mock_github_instance.create_or_update_file.return_value = None
41+
42+
# Set up the XmlEditor mock for both config and job
43+
mock_editor_instance = MockXmlEditor.return_value
44+
mock_editor_instance.convert_template_to_scraper.return_value = "<scraper_config>config_data</scraper_config>"
45+
mock_editor_instance.convert_template_to_job.return_value = "<scraper_job>job_data</scraper_job>"
46+
47+
# Simulate the status change to READY_FOR_ENGINEERING
48+
self.collection.workflow_status = WorkflowStatusChoices.READY_FOR_ENGINEERING
49+
self.collection.save()
50+
51+
# Verify that the XML for both config and job are generated and written to GitHub
52+
expected_calls = [
53+
call(self.collection._scraper_config_path, "<scraper_config>config_data</scraper_config>", False),
54+
call(self.collection._scraper_job_path, "<scraper_job>job_data</scraper_job>", False),
55+
]
56+
mock_write_to_github.assert_has_calls(expected_calls, any_order=True)
57+
58+
@patch("sde_collections.models.collection.GitHubHandler") # Mock GitHubHandler in the correct module path
59+
@patch("sde_collections.models.collection.Collection._write_to_github")
60+
@patch("sde_collections.models.collection.XmlEditor")
61+
def test_ready_for_curation_triggers_indexer_config_and_job_creation(
62+
self, MockXmlEditor, mock_write_to_github, MockGitHubHandler
63+
):
64+
"""
65+
When the collection's workflow status is updated to READY_FOR_CURATION,
66+
it should trigger indexer config and job creation methods.
67+
"""
68+
# Mock GitHubHandler to avoid actual API calls
69+
mock_github_instance = MockGitHubHandler.return_value
70+
mock_github_instance.check_file_exists.return_value = True # Assume scraper exists
71+
mock_github_instance._get_file_contents.return_value = MagicMock()
72+
mock_github_instance._get_file_contents.return_value.decoded_content = (
73+
b"<scraper_config>Mock Data</scraper_config>"
74+
)
75+
76+
# Set up the XmlEditor mock for both config and job
77+
mock_editor_instance = MockXmlEditor.return_value
78+
mock_editor_instance.convert_template_to_indexer.return_value = "<indexer_config>config_data</indexer_config>"
79+
mock_editor_instance.convert_template_to_job.return_value = "<indexer_job>job_data</indexer_job>"
80+
81+
# Simulate the status change to READY_FOR_CURATION
82+
self.collection.workflow_status = WorkflowStatusChoices.READY_FOR_CURATION
83+
self.collection.save()
84+
85+
# Verify that the XML for both indexer config and job are generated and written to GitHub
86+
expected_calls = [
87+
call(self.collection._indexer_config_path, "<indexer_config>config_data</indexer_config>", True),
88+
call(self.collection._indexer_job_path, "<indexer_job>job_data</indexer_job>", False),
89+
]
90+
mock_write_to_github.assert_has_calls(expected_calls, any_order=True)

0 commit comments

Comments
 (0)