soda-curation is a professional Python package for automated data curation of scientific manuscripts using AI capabilities. It specializes in processing and structuring ZIP files containing manuscript data, extracting figure captions, and matching them with corresponding images and panels.
- Features
- Installation
- Configuration
- Prompts
- Usage
- Pipeline Steps
- Output Schema
- Testing
- Docker
- Quality Control (QC) Pipeline
- Contributing
- License
- Automated processing of scientific manuscript ZIP files
- AI-powered extraction and structuring of manuscript information
- Figure and panel detection using advanced object detection models
- Intelligent caption extraction and matching for figures and panels
- Support for OpenAI's GPT models
- Flexible configuration options for fine-tuning the curation process
- Debug mode for development and troubleshooting
- Integrated Quality Control (QC) pipeline for automated figure and data assessment
-
Clone the repository:
git clone https://github.com/source-data/soda-curation.git cd soda-curation
-
Install the package using Poetry:
poetry install
Or, if you prefer to use pip:
pip install -e .
-
Set up environment variables: Create environment-specific .env files:
Using environment variables is the recommended way to store sensitive information like API keys:
OPENAI_API_KEY=your_openai_key ENVIRONMENT=test # or dev or prod
The configuration system uses a flexible, hierarchical approach supporting different environments (dev, test, prod) with environment-specific settings. Configuration is managed through:
- YAML files for general settings (e.g.,
config.dev.yaml
,config.qc.yaml
) - Environment variables for sensitive information
- Command-line arguments for runtime options
- Main pipeline config: Controls manuscript processing, AI model selection, and pipeline steps.
- QC config (
config.qc.yaml
): Controls all quality control tests, test metadata, and versioning. Example:
qc_version: "0.3.1"
qc_test_metadata:
panel:
plot_axis_units:
name: "Plot Axis Units"
description: "Checks whether plot axes have defined units for quantitative data."
permalink: "https://github.com/source-data/soda-mmQC/blob/main/.../plot-axis-units/prompts/prompt.3.txt"
error_bars_defined:
name: "Error Bars Defined"
description: "Checks whether error bars are defined in the figure caption."
prompt_version: 2
checklist_type: "fig-checklist"
# ... more tests ...
figure:
# Figure-level tests...
document:
# Document-level tests...
default:
openai:
model: "gpt-4o"
temperature: 0.1
# ...
The application supports different environments through Docker:
# For CPU-only environments
docker build -t soda-curation-cpu . -f Dockerfile.cpu --target development
# Build and run development environment
docker-compose -f docker-compose.dev.yml build
docker-compose -f docker-compose.dev.yml run --rm soda /bin/bash
# Development with console access
docker compose -f docker-compose.dev.yml run --rm --entrypoint=/bin/bash soda
Inside the container:
poetry run python -m src.soda_curation.main \
--zip /app/data/archives/your-manuscript.zip \
--config /app/config.yaml \
--output /app/data/output/results.json
poetry run python -m src.soda_curation.main \
--zip /app/data/archives/EMM-2023-18636.zip \
--config /app/config.dev.yaml \
--output /app/data/output/EMM-2023-18636.json
# Inside the container
poetry run pytest tests/test_suite
# With coverage report
poetry run pytest tests/test_suite --cov=src tests/ --cov-report=html
The package includes a comprehensive benchmarking system for evaluating model performance across different tasks and configurations. The benchmarking system is configured through config.benchmark.yaml
and runs using pytest.
# Run the benchmark tests
poetry run pytest tests/test_pipeline/run_benchmark.py
The benchmarking system is configured through config.benchmark.yaml
:
# Global settings
output_dir: "/app/data/benchmark/"
ground_truth_dir: "/app/data/ground_truth"
manuscript_dir: "/app/data/archives"
prompts_source: "/app/config.dev.yaml"
# Test selection
enabled_tests:
- extract_sections
- extract_individual_captions
- assign_panel_source
- extract_data_availability
# Model configurations to test
providers:
openai:
models:
- name: "gpt-4o"
temperatures: [0.0, 0.1, 0.5]
top_p: [0.1, 1.0]
# Test run configuration
test_runs:
n_runs: 1 # Number of times to run each configuration
manuscripts: "all" # Can be "all", a number, or specific IDs
-
Test Selection: Choose which pipeline components to evaluate:
- Section extraction
- Individual caption extraction
- Panel source assignment
- Data availability extraction
-
Model Configuration: Configure different models and parameters:
- Multiple providers (OpenAI, Anthropic)
- Various models per provider
- Temperature and top_p parameter combinations
- Multiple runs per configuration
-
Output and Metrics:
- Results are saved in the specified output directory
- Generates CSV files with detailed metrics
- Saves prompts used for each test
- Creates comprehensive test reports
The benchmark system generates several output files:
-
metrics.csv
: Contains detailed performance metrics including:- Task-specific scores
- Model parameters
- Execution times
- Input/output comparisons
-
prompts.csv
: Documents the prompts used for each task:- System prompts
- User prompts
- Task-specific configurations
-
results.json
: Detailed test results including:- Raw model outputs
- Expected outputs
- Scoring details
- Error information
The soda-curation pipeline processes scientific manuscripts through the following detailed steps:
- Purpose: Extract and organize the manuscript's structure and components
- Process:
- Parses the ZIP file to identify manuscript components (XML, DOCX/PDF, figures, source data)
- Creates a structured representation of the manuscript's files
- Establishes relationships between figures and their associated files
- Extracts manuscript content from DOCX/PDF for further analysis
- Builds the initial
ZipStructure
object that will be enriched throughout the pipeline
- Purpose: Identify and extract critical manuscript sections
- Process:
- Uses AI to locate figure legend sections and data availability sections
- Extracts these sections verbatim to preserve all formatting and details
- Verifies extractions against the original document to prevent hallucinations
- Returns structured content for further processing
- Preserves HTML formatting from the original document
- Purpose: Parse figure captions into structured components
- Process:
- Divides full figure legends section into individual figure captions
- For each figure, extracts:
- Figure label (e.g., "Figure 1")
- Caption title (main descriptive heading)
- Complete caption text with panel descriptions
- Identifies panel labels (A, B, C, etc.) within each caption
- Ensures panel labels follow a monotonically increasing sequence
- Associates each panel with its specific description from the caption
- Purpose: Extract structured data source information
- Process:
- Analyzes the data availability section to identify database references
- Extracts database names, accession numbers, and URLs/DOIs
- Structures this information for linking to the appropriate figures/panels
- Creates standardized references to external data sources
- Purpose: Match source data files to specific figure panels
- Process:
- Analyzes file names and patterns in source data files
- Maps each source data file to its corresponding panel(s)
- Uses panel indicators in filenames, data types, and logical groupings
- Identifies files that cannot be confidently assigned to specific panels
- Handles cases where files belong to multiple panels
- Purpose: Detect individual panels within figures and match with captions
- Process:
-
Panel Detection:
- Uses a trained YOLOv10 model to detect panel regions within figure images
- Identifies bounding boxes for each panel with confidence scores
- Handles complex multi-panel figures with varying layouts
-
AI-Powered Caption Matching:
- For each detected panel region, extracts the panel image
- Uses AI vision capabilities to analyze panel contents
- Matches visual content with appropriate panel descriptions from the caption
- Resolves conflicts when multiple detections map to the same panel label
- Assigns sequential labels (A, B, C...) to any additional detected panels
- Preserves original caption information while adding visual context
-
- Purpose: Compile all processed information and verify quality
- Process:
- Assembles the complete manuscript structure with all enriched information
- Calculates hallucination scores to verify content authenticity
- Cleans up source data file references
- Computes token usage and cost metrics for AI operations
- Generates structured JSON output according to the defined schema
Throughout these steps, the pipeline leverages AI capabilities to enhance the accuracy of caption extraction and panel matching. The process is configurable through the config.yaml
file, allowing for adjustments in AI models, detection parameters, and debug options.
In debug mode, the pipeline can be configured to process only the first figure, saving time during development and testing. Debug images and additional logs are saved to help with troubleshooting and refinement of the curation process.
The pipeline now uses an integrated verification approach to ensure text extractions are verbatim rather than hallucinated or modified by the AI.
Instead of post-processing comparison with fuzzy matching, the system now:
- Uses AI Agent Tools: Specialized verification tools check if extractions are verbatim during the AI processing, not afterward.
- Multi-Attempt Verification: If verification fails, the AI tries up to 5 times to produce a verbatim extraction.
- Explicit Verbatim Flagging: Each extraction includes an
is_verbatim
field indicating verification success.
Three main verification tools have been implemented:
- verify_caption_extraction: Ensures figure captions are extracted verbatim from the manuscript text.
- verify_panel_sequence: Confirms panel labels follow a complete sequence without gaps (A, B, C... not A, C, D...).
- General verification tool: For sections like figure legends and data availability.
{
"manuscript_id": "string",
"xml": "string",
"docx": "string",
"pdf": "string",
"appendix": ["string"],
"figures": [{
"figure_label": "string",
"img_files": ["string"],
"sd_files": ["string"],
"panels": [{
"panel_label": "string",
"panel_caption": "string",
"panel_bbox": [number, number, number, number],
"confidence": number,
"ai_response": "string",
"sd_files": ["string"],
"hallucination_score": number
}],
"unassigned_sd_files": ["string"],
"duplicated_panels": ["object"],
"ai_response_panel_source_assign": "string",
"hallucination_score": number,
"figure_caption": "string",
"caption_title": "string"
}],
"ai_config": {
"provider": "string",
"model": "string",
"temperature": number,
"top_p": number,
"max_tokens": number
},
"data_availability": {
"section_text": "string",
"data_sources": [
{
"database": "string",
"accession_number": "string",
"url": "string"
}
]
},
"errors": ["string"],
"ai_response_locate_captions": "string",
"ai_response_extract_individual_captions": "string",
"non_associated_sd_files": ["string"],
"locate_captions_hallucination_score": number,
"locate_data_section_hallucination_score": number,
"ai_provider": "string",
"cost": {
"extract_sections": {
"prompt_tokens": number,
"completion_tokens": number,
"total_tokens": number,
"cost": number
},
"extract_individual_captions": {
"prompt_tokens": number,
"completion_tokens": number,
"total_tokens": number,
"cost": number
},
"assign_panel_source": {
"prompt_tokens": number,
"completion_tokens": number,
"total_tokens": number,
"cost": number
},
"match_caption_panel": {
"prompt_tokens": number,
"completion_tokens": number,
"total_tokens": number,
"cost": number
},
"extract_data_sources": {
"prompt_tokens": number,
"completion_tokens": number,
"total_tokens": number,
"cost": number
},
"total": {
"prompt_tokens": number,
"completion_tokens": number,
"total_tokens": number,
"cost": number
}
}
}
manuscript_id
: Unique identifier for the manuscriptxml
: Path to the XML file in the ZIP archivedocx
: Path to the DOCX file in the ZIP archivepdf
: Path to the PDF file in the ZIP archiveappendix
: List of paths to appendix filesfigures
: Array of figure objects, each containing:figure_label
: Label of the figure (e.g., "Figure 1")img_files
: List of paths to image files for this figuresd_files
: List of paths to source data files for this figurefigure_caption
: Full caption of the figurecaption_title
: Title of the figure captionhallucination_score
: Score between 0-1 indicating possibility of hallucination (0 = verified content, 1 = likely hallucinated)panels
: Array of panel objects, each containing:panel_label
: Label of the panel (e.g., "A", "B", "C")panel_caption
: Caption specific to this panelpanel_bbox
: Bounding box coordinates of the panel [x1, y1, x2, y2] in relative formatconfidence
: Confidence score of the panel detectionai_response
: Raw AI response for this panelsd_files
: List of source data files specific to this panelhallucination_score
: Score between 0-1 indicating possibility of hallucination (0 = verified content, 1 = likely hallucinated)
unassigned_sd_files
: Source data files not assigned to specific panelsduplicated_panels
: List of panels that appear to be duplicatesai_response_panel_source_assign
: AI response for panel source assignment
errors
: List of error messages encountered during processingai_response_locate_captions
: Raw AI response for locating figure captionsai_response_extract_individual_captions
: Raw AI response for extracting individual captionsnon_associated_sd_files
: List of source data files not associated with any specific figure or panellocate_captions_hallucination_score
: Score between 0-1 indicating possibility of hallucination in the captions extractionlocate_data_section_hallucination_score
: Score between 0-1 indicating possibility of hallucination in the data section extractionai_config
: Configuration details of the AI processingdata_availability
: Information about data availabilitysection_text
: Text describing the data availability sectiondata_sources
: List of data sources with database, accession number, and URLdatabase
: Name of the databaseaccession_number
: Accession number or identifierurl
: URL to the data source (can also be a DOI)
ai_provider
: Identifier for the AI provider usedcost
: Detailed breakdown of token usage and costs for each processing step
To format and lint your code, run the following command:
# Build the docker-compose image
docker-compose build format
# Run the formatting and linting checks
docker-compose run --rm format
The QC pipeline provides automated, configurable, and extensible quality assessment of scientific figures and data presentation. It can be run independently or as part of the main curation workflow.
- Schema-based analyzer detection: Automatically determines test types (panel/figure/document) by analyzing Pydantic model structures from schemas
- Intelligent fallback logic: Uses schema analysis first, then config-based detection, then naming conventions
- Config-driven test metadata and versioning: All test names, descriptions, and permalinks are defined in
config.qc.yaml
. - Flexible prompt file naming: Supports arbitrary prompt filenames (e.g.,
prompt.3.txt
,custom_prompt.txt
) instead of fixed naming - Benchmark.json metadata integration: Enriches test metadata with descriptions and examples from mmQC repository
- Word document processing: ManuscriptQCAnalyzer can process actual Word documents for document-level analysis
- Unified output format: Uses
qc_checks
andcheck_name
structure with enhanced metadata - Hierarchical test organization: Tests can be defined at panel, figure, or document level
- Generic test implementation: New tests can be added without writing custom code
poetry run python -m src.soda_curation.qc.main \
--config config.qc.yaml \
--figure-data data/output/your_figure_data.json \
--zip-structure data/output/your_zip_structure.pickle \
--output data/output/qc_results.json
poetry run python -m src.soda_curation.qc.main \
--config config.qc.yaml \
--figure-data data/output/EMM-2023-18636_figure_data.json \
--zip-structure data/output/EMM-2023-18636_zip_structure.pickle \
--output data/output/qc_results.json
Major configuration improvements in config.qc.yaml
:
-
Removed
example_class
dependency: The system now automatically detects analyzer types using schema analysis# OLD (no longer needed): qc_check_metadata: panel: plot_axis_units: name: "Plot Axis Units" prompt_file: "prompt.3.txt" checklist_type: "fig-checklist" example_class: "panel" # ← REMOVED # NEW (automatic detection): qc_check_metadata: panel: plot_axis_units: name: "Plot Axis Units" prompt_file: "prompt.3.txt" # ← Clean, flexible naming checklist_type: "fig-checklist"
-
Schema-based type detection: Analyzer types are automatically determined by analyzing Pydantic models:
- Panel-level: Schemas with lists containing
panel_label
fields - Figure-level: Schemas with object structures (no lists)
- Document-level: Schemas with document-related fields (
sections
,abstract
, etc.)
- Panel-level: Schemas with lists containing
-
Flexible prompt naming: Support for arbitrary prompt filenames:
qc_check_metadata: panel: stat_test: name: "Statistical Test Mentioned" prompt_file: "prompt.4.txt" # New flexible file naming plot_axis_units: name: "Plot Axis Units" prompt_file: "prompt.3.txt" # Still supported for backward compatibility stat_significance_level: name: "Statistical Significance Level Defined" prompt_file: "prompt.2.txt" # Example of custom filename document: section_order: name: "Manuscript Structure Check" prompt_file: "prompt.2.txt" checklist_type: "doc-checklist"
-
Enhanced metadata integration: Automatic enrichment from
benchmark.json
files in mmQC repository -
Version bump: Updated
qc_version
to "2.3.1" to reflect major improvements and bug fixes
Migration Guide: If upgrading from v2.2.x or earlier, simply remove all example_class
fields from your config.qc.yaml
. The system will automatically detect the correct analyzer types using the new schema-based detection.
Here's the full structure of a modern config.qc.yaml
:
qc_version: "2.3.1"
qc_check_metadata:
panel:
plot_axis_units:
name: "Plot Axis Units"
prompt_file: "prompt.3.txt"
checklist_type: "fig-checklist"
stat_test:
name: "Statistical Test Mentioned"
prompt_file: "prompt.4.txt"
checklist_type: "fig-checklist"
individual_data_points:
name: "Individual Data Points Displayed"
prompt_file: "prompt.2.txt"
checklist_type: "fig-checklist"
figure:
# Figure-level tests would go here
# (automatically detected from schema structure)
document:
section_order:
name: "Manuscript Structure Check"
prompt_file: "prompt.2.txt"
checklist_type: "doc-checklist"
# OpenAI configuration
default: &default
openai:
model: "gpt-4o"
temperature: 0.1
top_p: 1.0
max_tokens: 2048
frequency_penalty: 0.0
presence_penalty: 0.0
json_mode: true
The debug visualizer helps you inspect what the AI is analyzing by extracting figure images and captions for visual inspection.
# Extract all figures from a dataset
poetry run python -m src.soda_curation.debug_visualizer \
data/output/EMM-2023-18636_figure_data.json \
--output-dir data/debug_images \
--prefix EMM-2023-18636
# Just analyze image properties without extracting
poetry run python -m src.soda_curation.debug_visualizer \
data/output/EMM-2023-18636_figure_data.json \
--analyze
Compare QC results against actual figure content to identify potential issues:
# Run QC analysis to identify issues
poetry run python -m src.soda_curation.qc_analysis \
data/output/qc_results.json \
data/output/EMM-2023-18636_figure_data.json \
--report data/debug_images/qc_analysis_report.html
Debug outputs include:
- Individual PNG files - Exact images the AI analyzes
- Caption text files - Complete captions for each figure
- HTML summary - Browser-viewable overview of all figures
- QC analysis report - Detailed comparison of QC results vs actual content
This helps identify issues like:
- Missing statistical notation detection (
mean ± SD
) - Incorrect sample size parsing (
n=3
) - Caption parsing failures
- Panel identification problems
{
"qc_version": "2.3.1",
"qc_check_metadata": {
"plot_axis_units": {
"name": "Plot Axis Units",
"prompt_file": "prompt.3.txt",
"checklist_type": "fig-checklist",
"description": "Automatically extracted from benchmark.json",
"examples": ["Sample analysis examples..."],
"permalink": "https://github.com/source-data/soda-mmQC/tree/dev/prompt.3.txt"
},
"stat_test": {
"name": "Statistical Test Mentioned",
"prompt_file": "prompt.4.txt",
"checklist_type": "fig-checklist",
"description": "Checks if statistical tests are mentioned",
"permalink": "https://github.com/source-data/soda-mmQC/tree/dev/prompt.4.txt"
}
},
"figures": [
{
"figure_label": "Figure 1",
"panels": [
{
"panel_label": "A",
"qc_checks": [
{
"check_name": "plot_axis_units",
"passed": true,
"details": "Units clearly defined on both axes"
}
]
}
]
}
]
}
To add a new test to the QC pipeline, follow these steps:
-
Define test metadata in the config:
-
Open
config.qc.yaml
. -
Under
qc_test_metadata
in the appropriate level (panel, figure, document), add a new entry for your test. Include at leastname
,description
, andpermalink
fields. Example:qc_test_metadata: panel: my_new_test: name: "My New Test" description: "Checks for a new quality control criterion." permalink: "https://github.com/source-data/soda-mmQC/blob/main/.../my-new-test/prompts/prompt.txt"
-
-
Configure the test in the pipeline section:
-
Still in
config.qc.yaml
, add any specific settings underdefault
if needed:default: openai: model: "gpt-4o" temperature: 0.1 # ...other settings...
-
-
Run the QC pipeline:
- Execute the pipeline as usual. Your new test will be automatically detected and run.
- The system will generate appropriate test models and integrate results into the output.
-
(Optional) Add test documentation:
- Ensure the
permalink
in your metadata points to documentation or a prompt for your test.
- Ensure the
-
(Optional) Add custom analyzer:
- If your test requires specific logic beyond what the generic analyzers provide, create a custom analyzer class that extends the appropriate base class.
Tip: No custom code is required for most tests - the system will automatically generate test implementations based on the configuration.
qc_version: "2.1.0"
qc_test_metadata:
panel:
plot_axis_units:
name: "Plot Axis Units"
description: "Checks whether plot axes have defined units for quantitative data."
permalink: "https://github.com/source-data/soda-mmQC/blob/main/.../plot-axis-units/prompts/prompt.3.txt"
error_bars_defined:
name: "Error Bars Defined"
description: "Checks whether error bars are defined in the figure caption."
prompt_version: 2
checklist_type: "fig-checklist"
# ... more tests ...
Contributions to soda-curation are welcome! Here are some ways you can contribute:
- Report bugs or suggest features by opening an issue
- Improve documentation
- Submit pull requests with bug fixes or new features
Please ensure that your code adheres to the existing style and passes all tests before submitting a pull request.
- Fork the repository and clone your fork
- Install development dependencies:
poetry install --with dev
- Activate the virtual environment:
poetry shell
- Make your changes and add tests for new functionality
- Run tests to ensure everything is working:
./run_tests.sh
- Submit a pull request with a clear description of your changes
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or issues, please open an issue on the GitHub repository. We appreciate your interest and contributions to the soda-curation project!
- Bug Fix: Fixed CI test failures in
test_prompt_registry.py
due to incorrect attribute references - Test Fix: Updated tests to use
prompt_file
instead of non-existentprompt_number
attribute in PromptMetadata - Docker Fix: Ensured Docker environment properly builds with ultralytics/YOLOv10 dependencies
- Quality Assurance: All 245 tests now passing in Docker environment
- CI/CD: Resolved build failures that were blocking continuous integration
- Major QC Pipeline Enhancement: Implemented schema-based analyzer detection
- Schema-Based Type Detection: Automatically determines panel/figure/document test types by analyzing Pydantic model structures
- Intelligent Analyzer Selection: List schemas with
panel_label
→ panel-level, object schemas → figure-level, document fields → document-level - Flexible Prompt File Naming: Support for arbitrary prompt filenames instead of fixed
prompt.1.txt
pattern - Enhanced Benchmark Integration: Rich metadata from
benchmark.json
files with automatic description enrichment - Word Document Processing: ManuscriptQCAnalyzer now processes actual Word documents (.docx) for manuscript analysis
- Output Format Modernization: Updated to
qc_checks
/check_name
structure removing deprecatedqc_tests
/test_name
- Robust Fallback Logic: Schema detection → config-based detection → naming convention fallbacks
- Complete Test Coverage: 26/26 QC tests passing with comprehensive validation
- Removed example_class Dependency: No longer requires manual
example_class
configuration
- Complete refactoring of the QC pipeline with abstract base classes and factory pattern
- Added support for hierarchical test organization (panel, figure, document levels)
- Implemented generic test analyzers for different test levels
- Removed individual test modules in favor of dynamic test generation
- Enhanced error handling and robustness for test execution
- Improved metadata handling with hierarchical config structure
- Simplified output format by removing redundant fields
- Added fallback mechanisms for missing schemas and prompts
- QC pipeline now sources test metadata and version from config file
- Output includes
qc_test_metadata
andqc_version
fields for all runs - Permalinks for each QC test are included in the config and output
- Improved handling of test status (
passed: null
when not needed) - Patch-level version bump for both soda-curation and QC pipeline
- Added Quality Control (QC) module for automated manuscript assessment
- Implemented statistical test reporting analysis for figures
- Dynamic loading of QC test modules from configuration
- Automatic generation of QC data during main pipeline execution
- 90% test coverage achieved across the codebase
- Case insensitive panel caption matching added
- Normalization of database links
- Permanent links of identifiers.org added
- Changed logic to modify EPS into thumbnails to have same results as UI
- Semideterministic individual caption extraction
- Verbatim check tool for agentic AI added to ensure verbatim caption extractions
- Remove hallucination score from panels
- Remove original source data files from figure level source data
- Replaced fuzzy-matching hallucination detection with AI agent verification tools
- Added tools for verbatim extraction verification of figure captions, sections, and panel sequences
- Enhanced panel detection to identify all panels in figures regardless of caption mentions
- Improved panel labeling to ensure sequential labels (A, B, C...) without gaps
- Modified normalization of tests and solved some error issues on benchmarking
- Modified two ground truths with the wrong values in extracting all the captions
- Modified the readme
- Added more robust normalization for the detection of possible hallucinated text
- Test coverage added
- Current test coverage is 92%
- Reformatting benchmark code into a package for better readibility
- Added panel-caption matching to the benchmark
- Improved the handling of
.eps
and.tif
files withImageMagick
andopencv
- For the future, we could use
tifffile
but it requires upgrade topython3.10
- For the future, we could use
- Figures with no or single panel return now a single panel object
- Ground truth modified to include HTML and removed manuscript id from internal files
- Updated README.md
- Addition of hallucination scores to the output of the pipeline
- Ensure no panel duplication
- Generating output captions keeping the
HTML
text from thedocx
file - No panels allowed for figures with single panels
- Addition of panel label position to the panel matching prompt searching to increase the performance
- Removal of manuscript ID from the source data file outputs
- Correction of non standard encoding in file names
- Major changes
- Changes in the configuration and environment definition
- Pipeline configurable at every single step, allowing for total flexibility in AI model and parameter selection
- Extraction of data availability and figure legends sections into a single step
- Fusion of match panel caption and object detection into a single step
- Minor changes:
- Support for large images
- Support for
.ai
image files - Removal of hallucinated files from the list of
sd_files
in output - Ignoring windows cache files from the file assignation
- Updated output schema documentation to match actual output structure
- Improved panel source data assignment with full path preservation
- Enhanced error handling in panel caption matching
- Updated AI configuration handling
- Changing from the AI assistant API to the Chat API in OpenAI
- Supporting
test
,dev
andprod
environments - Addition of tests and CI/CD pipeline
- Allow for storage of evaluation and model performance
- Prompts defined in the configuration file, now keeping configuration separately for each pipeline step
This tag is the stable version of the soda-curation package to extract the following information of papers using OpenAI
- XML manifest and structure
- Figure legends
- Figure panels
- Associate each figure panel to the corresponding caption test
- Associate source data at a panel level
- Extraction of the data availability section
- Includes model benchmarking on ten annotated ground truth manuscripts
- Obsolete tests removed
- Addition of benchmarking capabilities
- Adding manuscripts as string context to the AI models instead of DOCX or PDF files to improve behavior
- Ground truth data added
- Initial release
- Support for OpenAI and Anthropic AI providers
- Implemented figure and panel detection
- Added caption extraction and matching functionality