Releases: NASA-IMPACT/COSMOS
v3.1.0
COSMOS v3.1.0 Release Notes
Overview
COSMOS v3.1.0 introduces a major new machine learning classification pipeline and significant improvements to system reliability and user experience. The centerpiece of this release is the ML Classification Queue system, which enables automated document classification via the separate inference API. This is currently used to populate the Time Domain and Multi-Messenger Astronomy (TDAMM) portal by automatically tagging astrophysics content for specialized discovery. In the future, we will leverage it for division and document type tagging, as well as other metadata automation tasks.
This release also includes comprehensive testing infrastructure improvements with new frontend and backend test suites, ensuring code quality and reliability. The user interface has been enhanced with multiple usability fixes for common workflows, and several critical bugs have been resolved to improve system stability. Administrative capabilities have been expanded with better logging, form validation, and API enhancements.
Major Features
ML Classification Pipeline
- New Classification System: Implemented a robust job processing mechanism to batch URLs for the inference API
- Smart Batching: Added intelligent text length management with configurable maximums
- Comprehensive Job Tracking: New models to track individual jobs sent to the API:
- ModelVersion: Tracks multiple versions of classification models
- InferenceJob: Manages jobs for collections of URLs
- ExternalJob: Represents batched jobs sent to the inference API
- Status Management: Complete workflow with status tracking (queued, pending, completed, failed, cancelled)
- Classification Threshold Processing: Implemented class-based thresholding for classification results
- TDAMM Tag Updates: Removed redundant tags and added missing ones for more accurate classification
- Celery Integration: Scheduled processing during off-hours on weekdays and continuously on weekends
- Admin: New admin panels for viewing Model Versions, Inference Queue, and External Jobs
API Enhancements
- Feedback Form Dropdown: New API endpoint and dropdown options for the feedback form
- TDAMM Tag Serialization: Modified serialization method in the CuratedURLAPISerializer to better support frontend
- CORS Support: Added HTTPS link for SDE-LRM to CORS allowed origins
Testing Improvements
Frontend Testing Infrastructure
- Selenium WebDriver: Comprehensive frontend testing setup with Chrome
- Authentication Testing: Implemented test suite for authentication flows
- UI Component Tests: Added tests for collection display, data tables, and search functionality
- Form Validation: Created tests for pattern application forms with validation checks
Backend Testing Coverage
- Job Generation Pipeline: Enhanced coverage for config and job creation pipeline
- XML Processing: Comprehensive tests for XML processing
- Critical Functionality Planning: Identified critical areas of the codebase for future testing
- Coverage Reporting: Integrated coverage.py for automated coverage reports on PRs
Infrastructure Updates
System Automation
- Scraper and Indexer Management: Updated nomenclature and parameterized the convert_template_to_job method
- Job Generation: Streamlined job creation during the curation workflow
- XML Processing: Enhanced XML processing to facilitate configuration generation
Administrative Improvements
- Slack Notifications: Enhanced import notifications with detailed status updates
- Feedback System: Updated slack notification structure with dropdown option text
- Testing Strategy: Introduced comprehensive testing strategy documentation
- Changelog: Introduced CHANGELOG.md to provide consumable descriptions of PRs
Bug Fixes
Data Processing
- Zero-Value Document Type: Fixed approximately 2,000 documents with document type value of 0 in nasa_science
- URL Import Logging: Enhanced logging to show expected, succeeded, and failed URL imports
- Document Type Creator Form: Set multi-select as default option for pattern creation forms
User Interface
- Quote Escaping: Fixed issue with quotes not being properly escaped in titles
- Scroll Position: Preserved scroll position when selecting document types on individual URLs
- Pattern Filtering: Fixed filtering issues in the Title Patterns multi-URL pattern selection
- Column Sorting: Corrected sorting behavior in Collections table for URL count columns
- Document Type Filter: Fixed filtering functionality in the Delta URLs page
- Button Layout: Improved spacing by arranging 'Show 100', 'CSV', and 'Customize Columns' buttons in one line
- Form Validation: Added appropriate error messages for empty document type selections
Security
- HTML Content Validation: Added validation to protect against HTML injection in the feedback form
- Secure Resource Loading: Ensured all external resources load securely by switching to HTTPS and adding SRI checks
What's Changed
- Specify pattern match type in the pattern forms by @dhanur-sharma in #1172
- Add Curated URLs column to homepage by @dhanur-sharma in #1170
- Added affected curated urls count on url pattern pages by @dhanur-sharma in #1169
- Uniform Handling of Errors throughout COSMOS by @saifrk in #1136
- Update exclude checkmark action to change behavior based on inclusion status by @Kirandawadi in #1167
- Enforce Code Quality at PR Time by @saifrk in #1201
- Automatic Running of Tests On Pull Request by @saifrk in #1123
- 1177 notifications update slack notification pipeline by @bishwaspraveen in #1200
- API Tests for Token Verification, Request Accuracy, Response Parsing, and Error Handling by @saifrk in #1089
- Create CHANGELOG.md by @CarsonDavis in #1221
- Tests for critical functionalities by @saifrk in #1220
- HTML validator has been set at serializer level by @bishwaspraveen in #1218
- Implement unit test for forms on the frontend by @Kirandawadi in #1226
- Finalize the infrastructure for frontend testing by @Kirandawadi in #1222
- 960 notifications add a dropdown with options on the feedback form by @bishwaspraveen in #1210
- serialzed and changed API structure to fit LRM requirements by @bishwaspraveen in #1215
- 3227 bugfix title patterns selecting multi url pattern does nothing by @bishwaspraveen in #1230
- Added Changelog for issue #1192 and #1195 by @Kirandawadi in #1237
- Update run_full_test_suite.yml by @CarsonDavis in #1233
- Added changelog for Issue_1001 by @saifrk in #1234
- Frontend test "test_create_title_pattern" failing due to insufficient wait time by @Kirandawadi in #1236
- remove unused getParameterByName in delta_url_list.js by @CarsonDavis in #1239
- Merge Dev Into Staging by @CarsonDavis in #1213
- Tests for Config & Job Creation + XML Processing by @saifrk in #1225
- Updated template and job creation for scrapers and indexers by @dhanur-sharma in #1072
- changes js code to preserve y scroll position while saving by @bishwaspraveen in #1228
- Slack Notification when importing urls by @saifrk in #1229
- Fix the issues of doctypes having 0 as a doctype by @bishwaspraveen in #1031
- 1101 bug fix quotes not escaped in titles by @bishwaspraveen in #1244
- changed the default to multi url pattern by @bishwaspraveen in #1216
- Minor Enhancement: Document Type Pattern Form – Require Document Type or Show Appropriate Error by @saifrk in #1247
- Added https URL to allow CORS by @dhanur-sharma in #1250
- Alignment of ‘Show 100’, ‘CSV’, ‘CUSTOMIZE COLUMNS’ by @saifrk in #1242
- Implement HTTPS and add SRI to external resources to fix CodeQL alert by @saifrk in #1245
- Integrate classification queue by @CarsonDavis in #1248
- 1182 ml classification queue by @CarsonDavis in #1219
- merge bugfixes and ml integration to staging by @CarsonDavis in #1254
- Staging by @CarsonDavis in #1255
- add sde-lrm tdamm to allowed cors by @CarsonDavis in #1256
- Updated generate_inference_job to create ModelVersion if needed by @dhanur-sharma in #1260
- Updated settings by @dhanur-sharma in https://githu...
v3.0.0
COSMOS v3.0.0 Release Notes
Overview
COSMOS v3.0.0 introduces several major architectural changes that fundamentally enhance the system's capabilities. The primary feature is a new website reindexing system that allows COSMOS to stay up-to-date with source website changes, addressing a key limitation of previous versions where websites could only be scraped once. This release includes comprehensive updates to the data models, frontend interface, rule creation system, and backend processing along with some bugfixes from v2.0.1.
The Environmental Justice (EJ) system has been significantly expanded, growing less than 100 manually curated datasets to approximately 1,000 datasets through the integration of machine learning classification of NASA CMR records. This expansion is supported by a new modular processing suite that generates and extracts metadata using Subject Matter Expert (SME) criteria.
To support future machine learning integration, COSMOS now implements a sophisticated two-column system that allows fields to maintain both ML-generated classifications and manual curator overrides. This system has been seamlessly integrated into the data models, serializers, and APIs, ensuring that both automated and human-curated data can coexist while maintaining clear precedence rules.
To ensure reliability and maintainability of these major changes, this release includes extensive testing coverage with 213 new tests spanning URL processing, pattern management, Environmental Justice functionality, workflow triggers, and data migrations. Additionally, we've added comprehensive documentation across 15 new README files that cover everything from fundamental pattern system concepts to detailed API specifications and ML integration guidelines.
Major Features
Reindexing System
- New Data Models: Introduced DumpUrl, DeltaUrl, and CuratedUrl to support the reindexing workflow
- Automated Workflows:
- New process to calculate deltas, deletions, and additions during migration
- Automatic promotion of DeltaUrls to CuratedUrls
- Status-based triggers for data ingestion and processing
- Duplicate Prevention: System now prevents duplicate patterns and URLs
- Enhanced Frontend:
- Added reindexing status column to collection and URL list pages
- New deletion tracking column on URL list page
- Updated collection list to display delta URL counts
- Improved URL list page accessibility via delta URL count
Pattern System Improvements
- Complete modularization of the pattern system
- Enhanced handling of edge cases including overlapping patterns
- Improved unapply logic
- Functional inclusion rules
- Pattern precedence system: most specific pattern takes priority, with pattern length as tiebreaker
Environmental Justice (EJ) Enhancement
- Expanded from 92 manual datasets to 1063 ML-classified NASA CMR records
- New modular processing suite for metadata generation
- Enhanced API with multiple data sources:
- Spreadsheet (original manual classifications)
- ML Production
- ML Testing
- Combined (ML production with spreadsheet overrides)
- Custom processing suite for CMR metadata extraction
Infrastructure Updates
- Streamlined database backup and restore
- Optimized Docker builds
- Fixed LetsEncrypt staging issues
- Modified Traefik timeouts for long-running jobs
- Updated Sinequa worker configuration:
- Reduced worker count to 3 for neural workload optimization
- Added neural indexing to all webcrawlers
- Removed deprecated version mappings
API Enhancements
- New endpoints for curated and delta URLs:
- GET /curated-urls-api/str:config_folder/
- GET /delta-urls-api/str:config_folder/
- Backwards compatibility through remapped CandidateUrl endpoint
- Updated Environmental Justice API with new data source parameter
Technical Improvements
Two-Column System
- New architecture to support dual ML/manual classifications
- Seamless integration with models, serializers, and APIs
- Prioritization system for manual overrides
Testing
Added 213 new tests across multiple areas:
- URL APIs and processing (19 tests)
- Delta and pattern management (31 tests)
- Environmental Justice API (7 tests)
- Environmental Justice Mappings and Thresholding (58)
- Workflow and status triggers (10 tests)
- Migration and promotion processes (31 tests)
- Field modifications and TDAMM tags (25 tests)
- Additional system functionality (30 tests)
Documentation
Added comprehensive documentation across 15 READMEs covering:
- Pattern system fundamentals and examples
- Reindexing statuses and triggers
- Model lifecycles and testing procedures
- URL inclusion/exclusion logic
- Environmental Justice classifier and API
- ML column functionality
- SQL dump restoration
Bug Fixes
- Fixed non-functional includes
- Resolved pagination issues for patterns (previously limited to 50)
- Eliminated ability to create duplicate URLs and patterns
- Corrected faulty unapply logic for modification patterns
- Fixed unrepeatable logic for overlapping patterns
- Allowed long running jobs to complete without timeouts
UI Updates
- Renamed application from "SDE Indexing Helper" to "COSMOS"
- Refactored collection list code for easier column management
- Enhanced URL list page with new status and deletion tracking
- Improved navigation through delta URL count integration
Administrative Changes
- Added new admin panels for enhanced system management
- Updated installation requirements
- Enhanced database backup and restore functionality
What's Changed (PR Log)
- remove force reindexing from templates by @CarsonDavis in #1018
- point tree root to name by @CarsonDavis in #1027
- Change LRM dev configurations by @bishwaspraveen in #1034
- get URLs from scrapers folder for LRM servers by @bishwaspraveen in #1037
- change EnableNeuralIndexing to true in indexing template by @CarsonDavis in #1070
- Retrieve Full-Texts from Sinequa Dev Servers by @saifrk in #1077
- add per indicator thrsholding and new dump by @CarsonDavis in #1073
- 1051 backend model changes on cosmos to hold new incoming urls frontend by @dhanur-sharma in #1090
- 1051 backend model changes on cosmos to hold new incoming urls by @bishwaspraveen in #1069
- 1105 improve pattern application and exclusion management by @CarsonDavis in #1109
- remove destination_server and add datasource by @CarsonDavis in #1108
- Update cmr mappings by @CarsonDavis in #1102
- 1115 improve title processing and tests by @CarsonDavis in #1118
- Affected Delta URLs header added by @dhanur-sharma in #1117
- 3034 cosmos api test cases by @dhanur-sharma in #1114
- Pagination on the Sinequa sql.engine Api by @saifrk in #1104
- Updated page title to URLs by @dhanur-sharma in #1120
- Refresh page on workflow status change by @dhanur-sharma in #1124
- Refactor Two Column to work with Delta Urls by @Kirandawadi in #1103
- add initial reindexing statuses by @CarsonDavis in #1125
- View deleted URLs under Delta URLs page by @dhanur-sharma in #1121
- 1126 managepy command for database backups by @CarsonDavis in #1127
- 3055 optmize the retrieval of url counts on admin page by @bishwaspraveen in #1131
- Updated database restore command by @dhanur-sharma in #1130
- 1133 refactor indexing statuses logic by @CarsonDavis in #1134
- Updated dockerignore and gitignore by @dhanur-sharma in #1135
- 1139 resolve id conflict when promoting by @CarsonDavis in #1140
- Updated title pane to Delta URLs by @dhanur-sharma in #1141
- refactor readme for unapply logic and refactor unapply to account for overlapping patterns by @CarsonDavis in #1146
- Filters fixed by @dhanur-sharma in #1145
- add new field to reindexing statuses by @CarsonDavis in #1148
- Conditional anchor updated for 0 Delta URLs by @dhanur-sharma in #1161
- fixed paging on excludes and includes tabs by @bishwaspraveen in #1163
- 1150 status button color matches by @dhanur-sharma in #1162
- Add documentation for PairedFieldDescriptor implementation by @Kirandawadi in #1160
New Contributors
- @saifrk made their first contribution in #1077
- @dhanur-sharma made their first contribution in #1090
Full Changelog: 3f85f26...8df561a
v2.0.1
What's Changed
- Fix fake flake8 issues by @code-geek in #976
- Add LRM_QA_{USER, PASSWORD} variable to .django by @Kirandawadi in #985
- Make coding syntax consistent by @Kirandawadi in #990
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #979
- Add CONTRIBUTING.md file by @Kirandawadi in #996
- Add SQLDumpRestoration.md file by @Kirandawadi in #994
New Contributors
- @Kirandawadi made their first contribution in #985
- @pre-commit-ci made their first contribution in #979
Full Changelog: v2.0.0...v2.0.1
v2.0.0
What's Changed
-
Feature Enhancements: Integrated several new features, including a "Push to GitHub" button for selected collections, a conversation history webapp, and a JSON indexing template. Enhanced the indexing process with dynamic plugin generation and updated URL indexing endpoints.
-
Infrastructure and Configuration Updates: Improved project setup with updated configuration files and added mechanisms for automatic updates, such as using Celery to pull in URLs and updating collections based on production API. Switched Celery broker from Redis to SQS for better scalability.
-
Bug Fixes and Stability Improvements: Addressed various bugs, including inference bug fixes, preventing tag duplication, and resolving CORS issues on the frontend. Reverted certain changes for better stability and fixed issues related to job creation and indexing.
-
Codebase and API Updates: Introduced significant updates to the codebase, such as adding type hints, refreshing code libraries, and updating API endpoints to accommodate new features and feedback. Implemented functional tests using Selenium for enhanced reliability.
-
Admin and User Interface Improvements: Enhanced the webapp user experience by refining the UI, including removing clutter, automating file creation at specific status changes, and aligning webapp status implementation with the current process. Added admin actions for better management and visibility.
New Contributors
- @RajashreeDahal4 made their first contribution in #379
- @anisbhsl made their first contribution in #361
- @bishwaspraveen made their first contribution in #409
- @Jmok19927 made their first contribution in #690
- @emshahh made their first contribution in #689
- @Kshaw362 made their first contribution in #691
Full Changelog: v1.1.0...v2.0.0
v1.1.0
What's Changed
- Add pytest config for vscode by @code-geek in #246
- Add code to pull in connector type by @code-geek in #252
- Deal with collections that dont have a sinequa configuration by @code-geek in #256
- Import metadata from Sinequa configs into collections on the webapp by @code-geek in #257
- Implement soft delete filtering on collection list by @code-geek in #262
- Check if pull request already exists and dont hit the create api if it does by @code-geek in #264
- Update collections fixture with latest data by @code-geek in #266
- Add django extensions to prod by @code-geek in #268
- Fix bugs in the GitHub pipeline by @code-geek in #270
- When trying to remove a title pattern by deleting it from the input box, it throws an error by @rajdangol0077 in #258
New Contributors
- @impact-github-bot made their first contribution in #239
Full Changelog: v1.0.0...v1.1.0
v1.0.0
What's Changed
- Exclude patterns by @code-geek in #25
- Feature update models by @CarsonDavis in #27
- Add machine name by @CarsonDavis in #31
- Add jupyter notebook to ingest from csv by @code-geek in #43
- Added minor features by @SauravUpadhyaya in #8
- Feature track req urls by @CarsonDavis in #62
- Feature sinequa scraper by @code-geek in #89
- Dropdown to change the status of a collection on the collection list page by @code-geek in #90
- Dropdown to change the user who is curating a collection from the collection list page by @code-geek in #91
- Dev to main by @code-geek in #93
- Remove broken fields from collection detail page by @code-geek in #95
- Bring in recently indexed collections and set the status as ready to clean by @code-geek in #101
- Turn on statesave for candidate urls table by @code-geek in #102
- Add an option to curation status by @code-geek in #104
- Refactor to Scrape by Indexing and Generate Jobs in Parallel by @CarsonDavis in #110
- API to Ingest Candidate URLs in bulk from the test server by @code-geek in #120
- API to ingest candidate urls -- improved by @code-geek in #121
- Add scripts to export the entire index by @code-geek in #123
- Add code to pull entire index from s3 by @code-geek in #126
- Remove sidebar by @code-geek in #128
- Enable stateSave and reduce the number of rows on the collection list page by @code-geek in #139
- change Curated to Visited by @code-geek in #140
- Add link to sinequa configuration on the collection detail page by @code-geek in #142
- Update code to export select collections and document processes by @code-geek in #148
- Add ability to navigate to a page with user input by @code-geek in #150
- Add link to github issue on each collection by @code-geek in #152
- Prevent moving to the top of the page when changing dropdowns by @code-geek in #156
- Avoid refreshing the page on dropdown change by @code-geek in #157
- Add a status called delete/combine collection by @rajdangol0077 in #154
- Add filter for status and allow export csv by @code-geek in #160
- Show Title Patterns in the admin by @code-geek in #164
- Add pattern type as a filter to TitlePattern admin by @code-geek in #165
- Candidate URLs page improvements by @code-geek in #167
- Change cursor to pointer when hovering on select by @code-geek in #173
- Improve DocumentTypePattern modal by @code-geek in #176
- Allow wildcards in match patterns by @code-geek in #96
- Add a searchdelay of 1000 ms by @code-geek in #185
- DataTables enhancements -- search panes and select by @code-geek in #186
- Add ranges to candidate URL search pane by @code-geek in #188
- Speed up pattern creation and add ability to deselect document type by @code-geek in #190
- Change traefik timeout to 5 minutes by @code-geek in #192
- Set curation status to "Ready to curate" when new URLs are available by @code-geek in #194
- Streamline loading candidate urls into the web app using the API by @code-geek in #203
- Add mechanism to pull in URLs from prod by @code-geek in #210
- Hide github issue link if there isn't one; Make the field editable by @rajdangol0077 in #211
- Create a models module since models.py is getting too big by @code-geek in #215
- Push generated xml to github from the web app by @code-geek in #204
- Generate XML files from patterns by @CarsonDavis in #162
- Add new status for Github_PR_Created by @code-geek in #219
- change match criteria to be surrounded by single quotes by @CarsonDavis in #231
- Make curation status update when changed by @code-geek in #235
Contributors
- @code-geek
- @CarsonDavis
- @rajdangol0077
Full Changelog: https://github.com/NASA-IMPACT/sde-indexing-helper/commits/v1.0.0