Skip to content

Upload feature#80

Merged
medihack merged 48 commits intomainfrom
upload_dev
Oct 30, 2025
Merged

Upload feature#80
medihack merged 48 commits intomainfrom
upload_dev

Conversation

@lmdulz
Copy link
Collaborator

@lmdulz lmdulz commented Aug 29, 2024

Closes #19
Closes #27

@change="toggleUploadButtonVisibility()"
accept=".dcm,.DCM"
webkitdirectory
mozdirectory>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which browsers are supported? Do we need a hint somewhere what the browser support is?

seed
);

return anon;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analog zu genereller ADIT anonymisierung nach anonymisierung PatID und PatNamen mit Pseudonym ersetzen

@medihack
Copy link
Member

@codex

@chatgpt-codex-connector
Copy link

Summary

Testing

  • ⚠️ Not run (not requested)

View task →

@claude
Copy link

claude bot commented Oct 29, 2025

Code Review - Upload Feature (PR #80)

Thank you for this substantial contribution implementing the upload feature! This is a significant addition with ~26k lines added. Here's my detailed review:


Strengths

  1. Security Implementation

    • Proper permission checks with PermissionRequiredMixin and can_upload_data permission
    • CSRF token validation in API requests
    • User authentication requirements enforced
    • Client-side anonymization before upload (good privacy practice)
    • Permission verification for destination node access
  2. Code Quality

    • Follows Google Python Style Guide as per CONTRIBUTING.md
    • Good separation of concerns (models, views, forms, URLs)
    • Type hints used appropriately
    • Clean form validation with custom validators
  3. Testing

    • Includes both unit tests and acceptance tests
    • Permission tests verify security boundaries
    • Tests cover key workflows (clear files, stop upload)
  4. User Experience

    • Drag-and-drop support for files
    • Progress tracking during upload
    • Upload cancellation capability
    • HTMX integration for smooth UX

🔴 Critical Issues

1. Security: Anonymization Seed Exposure (adit/upload/views.py:50)

context["anon_seed"] = settings.ANONYMIZATION_SEED

Problem: The anonymization seed is passed directly to the client-side JavaScript. If this seed is meant to be secret for consistent anonymization across the system, exposing it to clients is a security vulnerability. Any user could extract this seed and potentially re-identify anonymized data.

Recommendation:

  • If the seed must be shared, document WHY it's acceptable
  • Consider per-user or per-session seeds instead
  • Alternatively, perform anonymization server-side only

2. Error Handling: Silent Failures (adit/upload/views.py:135-137)

except Exception:
    status = 500
    message = "Upload failed"

Problem: Bare except Exception catches ALL exceptions without logging. This makes debugging production issues nearly impossible.

Recommendation:

except Exception as e:
    logger.exception("Upload failed for user %s to node %s", request.user.id, node_id)
    status = 500
    message = "Upload failed"

3. Data Validation: Missing DICOM Validation (adit/upload/views.py:124)

The uploaded dataset is read without validation:

dataset = read_dataset(dataset_bytes)

Problem: No validation that the uploaded file is actually valid DICOM data. Malicious or corrupted files could cause issues.

Recommendation: Add try-except around DICOM parsing and validate structure.


⚠️ High Priority Issues

4. JavaScript: Sequential Uploads Block UI (upload.js:152-198)

for (const set of datasets) {
  // ... anonymize and upload ...
  status = await uploadData({...});
}

Problem: Uploads happen sequentially in a for-loop, blocking the browser for potentially long periods. For large studies (100+ images), this creates poor UX.

Recommendation: Implement batch uploads with concurrency:

const uploadPromises = datasets.map(async (set) => {
  // anonymize and upload
});
await Promise.allSettled(uploadPromises.slice(0, 5)); // 5 concurrent uploads

5. Memory Management: All Files Loaded Into Memory (upload.js:127-130)

for (const fileEntry of files) {
  await this.fileHandler(fileEntry, datasets);
}

Problem: ALL files are loaded into memory as ArrayBuffers before processing begins. For a 500-file study, this could exceed browser memory limits.

Recommendation: Stream-process files instead of loading all at once.

6. Data Integrity: isValidSeries Logic Issue (upload.js:316-351)

return (
  patientIDs.size == 1 &&
  patientNames.size == 1 &&
  patientBirthdates.size == 1
);

Problem: This validation ONLY checks if all files have the SAME patient info. It doesn't verify:

  • If any required DICOM tags are present
  • If the study belongs to the patient
  • Edge case: If ALL files are missing patient info, the sets would be empty (size 0) and validation would pass!

Recommendation: Add explicit checks:

// Ensure at least one valid patient was found
if (patientIDs.size === 0) {
  return false;
}
return patientIDs.size === 1 && patientNames.size === 1 && patientBirthdates.size === 1;

7. Race Condition: uploadResultText (upload.js:193, 206, 211)

Multiple code paths set uploadResultText without proper state management. If network responses arrive out of order, the wrong message could display.


💡 Medium Priority Issues

8. Code Duplication: finishUploadComplete/Incomplete (upload.js:220-252)

Both functions share similar logic. Consider consolidating.

9. Missing Input Sanitization (forms.py:20-24)

While no_control_chars_validator and no_backslash_char_validator are used, the pseudonym is directly inserted into DICOM tags without additional sanitization. Ensure validators are comprehensive.

10. Commented Test Code (test_views.py:13-14, 21-22)

# response = client.post(reverse("data_upload", kwargs={"node_id": 1}))
# assert response.status_code == 403

Recommendation: Either implement these tests or remove the comments.

11. TODO in core code (adit/core/utils/dicom_operator.py:53)

# TODO: also make retries and timeouts possible in DicomWebConnector

If this TODO is addressed in this PR, great! If not, consider if retry logic is important for upload reliability.

12. Admin Registration (admin.py:6)

admin.site.register(UploadSettings, admin.ModelAdmin)

Using the default ModelAdmin works but consider creating a custom admin class for better control.


🔍 Low Priority / Suggestions

13. Documentation

  • README.md moves upload from "Upcoming features" to current features - good!
  • Consider adding docstrings to upload_api_view and JavaScript functions
  • Document the anonymization approach and seed requirements

14. Performance: Client-Side DICOM Parsing

Reading DICOM files twice (once for validation, once for anonymization) is inefficient. Consider combining these operations.

15. JavaScript Error Messages

Error messages like "Upload Failed due to an Error" are not user-friendly. Consider more specific guidance.

16. Unused Variable (forms.py:29)

self.action = kwargs.pop("action")

The action parameter is stored but doesn't appear to be used in the form.

17. Vendor Files

Large vendor files (dcmjs.js - 23k lines) are committed. Consider:

  • Using a CDN
  • Using npm/package management
  • Adding a comment indicating version and source

📋 Testing Gaps

Consider adding tests for:

  1. Upload with invalid DICOM files
  2. Upload with mixed patient data (should fail validation)
  3. Upload to unauthorized destination node
  4. Upload with missing anonymization seed
  5. Concurrent uploads
  6. Upload of very large files
  7. Network failure scenarios

🎯 Summary & Recommendations

Must Fix Before Merge:

  1. Address anonymization seed security (Fully automate dev setup with Gitpod #1)
  2. Add proper error logging (wip #2)
  3. Fix isValidSeries edge case (Polymorphism #6)

Should Fix Before Merge:
4. Memory management for large uploads (#5)
5. Add DICOM validation (#3)

Consider for Follow-up PR:
6. Performance improvements (concurrent uploads, streaming)
7. Enhanced error messages
8. Additional test coverage


📝 Final Notes

This is a solid implementation of a complex feature! The architecture is clean, security is mostly well-handled, and the UX considerations (drag-drop, progress, cancel) are excellent. The main concerns are around the anonymization seed exposure and robustness for large uploads.

Great work overall! Please address the critical issues and consider the high-priority suggestions. Happy to review again once updated.

Reviewed following: Google Python Style Guide as per CONTRIBUTING.md

@claude
Copy link

claude bot commented Oct 30, 2025

Pull Request Review: Upload Feature (#80)

This is a comprehensive PR adding a DICOM upload feature with client-side anonymization. Overall, this is well-implemented with good test coverage. Below are my findings organized by category:


Strengths

Code Quality

  • Clean separation of concerns with dedicated modules (views, forms, models, utils)
  • Good use of Django's permission system with can_upload_data permission
  • Comprehensive acceptance tests covering multiple scenarios
  • TypeScript-style type checking in JavaScript with proper error handling
  • Follows Google Python Style Guide as per project standards

Test Coverage

  • Excellent acceptance test coverage (8 test cases) including:
    • Happy path (successful upload)
    • Error cases (missing pseudonym, missing destination)
    • Edge cases (unsupported file types, stopping upload)
    • Data validation (pseudonym verification with parametrized tests)
  • Permission-based access control tests

⚠️ Issues & Recommendations

🔴 Security Concerns (High Priority)

  1. Weak Anonymization Seed Handling (adit/upload/static/upload/upload.js:256-260)

    • Issue: seed += document.cookie uses string concatenation instead of proper cryptographic mixing
    • Risk: Predictable seed values compromise anonymization security
    • Recommendation: Use a proper HMAC or key derivation function (e.g., Web Crypto API's deriveBits)
    // Better approach:
    const encoder = new TextEncoder();
    const seedData = encoder.encode(seed + document.cookie);
    const hashBuffer = await crypto.subtle.digest('SHA-256', seedData);
    const hashArray = Array.from(new Uint8Array(hashBuffer));
    const hashHex = hashArray.map(b => b.toString(16).padStart(2, '0')).join('');
  2. Empty Anonymization Seed Default (adit/settings/base.py:458)

    • Issue: ANONYMIZATION_SEED = env.str("ANONYMIZATION_SEED", default="") allows empty seeds
    • Risk: Empty seed breaks anonymization (JS throws error, but should be caught earlier)
    • Recommendation: Fail fast in Django settings if seed is empty in production
    ANONYMIZATION_SEED = env.str("ANONYMIZATION_SEED", default="")
    if not ANONYMIZATION_SEED and not DEBUG:
        raise ImproperlyConfigured("ANONYMIZATION_SEED must be set in production")
  3. Seed Exposure in Templates (adit/upload/views.py:53)

    • Issue: Seed passed to template context could be visible in HTML source
    • Risk: If seed is exposed, anonymization is compromised
    • Recommendation: Verify the template properly protects this value (e.g., in a script tag with proper escaping)

🟡 Potential Bugs (Medium Priority)

  1. Memory Leak Risk in File Upload (adit/upload/static/upload/upload.js:143-163)

    • Issue: All DICOM files are loaded into memory sequentially with await file.arrayBuffer()
    • Risk: Large studies (hundreds of images) could cause browser memory issues
    • Recommendation: Add file size validation or implement chunked processing with periodic cleanup
    const MAX_TOTAL_SIZE = 500 * 1024 * 1024; // 500MB limit
    const totalSize = Array.from(files).reduce((sum, f) => sum + f.size, 0);
    if (totalSize > MAX_TOTAL_SIZE) {
        throw new Error(`Total size exceeds limit: ${totalSize} bytes`);
    }
  2. Incomplete Cleanup on Error (adit/upload/views.py:120-138)

    • Issue: If read_dataset fails, the uploaded file is still in memory but never explicitly closed
    • Risk: Memory leak on repeated failed uploads
    • Recommendation: Use proper context managers or explicit cleanup
    try:
        dataset_bytes = BytesIO(uploaded_file.read())
        dataset = read_dataset(dataset_bytes)
    except Exception as e:
        logger.exception(...)
        return HttpResponse(status=400, content="Invalid dataset")
    finally:
        dataset_bytes.close()  # Ensure cleanup
  3. Loose Equality Check (adit/upload/static/upload/upload.js:192)

    • Issue: if (loadedFiles == dataset_length) uses loose equality
    • Risk: Type coercion could cause unexpected behavior
    • Recommendation: Use strict equality: if (loadedFiles === dataset_length)
  4. Unused Function (adit/upload/apps.py:25-29)

    • Issue: create_app_settings() is defined but never called
    • Recommendation: Either call it in ready() or remove if not needed

🟠 Performance Considerations (Medium Priority)

  1. Synchronous Upload Loop (adit/upload/static/upload/upload.js:143-191)

    • Issue: Files are uploaded sequentially, one at a time
    • Impact: Slow for large studies (e.g., 200 files × 2 seconds = 6+ minutes)
    • Recommendation: Implement concurrent uploads with a semaphore (e.g., 5 concurrent uploads)
  2. N+1 Validation Pattern (adit/upload/static/upload/upload.js:309-346)

    • Issue: isValidSeries reads all files' DICOM headers before upload starts
    • Impact: Adds latency (read all files twice: once for validation, once for upload)
    • Recommendation: Validate during upload loop to reduce latency
  3. Blocking Async View (adit/upload/views.py:140-141)

    • Issue: await loop.run_in_executor(None, operator.upload_images, [dataset]) blocks the executor thread
    • Risk: Under high load, could exhaust thread pool
    • Recommendation: Consider using a task queue (Celery) for uploads

🔵 Code Quality & Style (Low Priority)

  1. Missing Type Hints (adit/upload/admin.py:6)

    • Issue: admin.site.register(UploadSettings, admin.ModelAdmin) could use custom ModelAdmin class
    • Recommendation: Add custom admin class with proper type hints
  2. Magic Numbers (adit/upload/static/upload/upload.js:218, 241)

    • Issue: Hardcoded 5000 millisecond timeout
    • Recommendation: Extract to named constant: const UPLOAD_COMPLETE_TIMEOUT = 5000;
  3. Inconsistent Error Messages

    • JavaScript uses generic messages like "Upload Failed" without details
    • Recommendation: Include error details from server response for better debugging
  4. Empty __init__.py Files

    • Multiple empty __init__.py files added (migrations, utils, tests)
    • This is fine but could include docstrings for module purposes

📋 Additional Observations

Positive Aspects:

  • ✅ Proper permission checks at both view and API level
  • ✅ CSRF protection properly implemented
  • ✅ Good separation between sync and async code
  • ✅ Client-side validation reduces server load
  • ✅ Progressive enhancement with progress bars and stop functionality
  • ✅ Parametrized tests demonstrate good testing practices

Documentation:

  • README updated to remove "upcoming features" (good!)
  • Consider adding:
    • JSDoc comments for complex JS functions
    • Docstrings for Python views explaining upload flow
    • Security notes about anonymization seed management

🎯 Recommended Action Items (Prioritized)

Must Fix Before Merge:

  1. ❗ Fix anonymization seed concatenation security issue
  2. ❗ Add validation to prevent empty ANONYMIZATION_SEED in production
  3. ❗ Replace loose equality with strict equality in JavaScript

Should Fix:

  1. Add memory limits for file uploads
  2. Improve error handling and cleanup in upload_api_view
  3. Either implement or remove create_app_settings()

Nice to Have:

  1. Implement concurrent uploads for better performance
  2. Add more detailed error messages to users
  3. Extract magic numbers to constants

📊 Summary

Verdict: Approve with required changes

This PR adds significant value by implementing the upload feature that users need. The code quality is generally high, tests are comprehensive, and the architecture is sound. However, the security issues around anonymization seed handling must be addressed before merging to production.

The main concerns are:

  • Security: Weak seed mixing mechanism
  • Performance: Sequential uploads could be slow for large studies
  • Robustness: Memory management needs improvement

Once the high-priority security issues are resolved, this will be a solid addition to the codebase.


Great work on this feature! 🎉 Let me know if you need clarification on any of these points.

@claude
Copy link

claude bot commented Oct 30, 2025

Pull Request Review: Upload Feature

This PR implements a DICOM upload portal with client-side anonymization. Overall, the feature is well-implemented with good test coverage.


Code Quality & Best Practices

Strengths

  • Good separation of concerns: Models, views, forms, and business logic are properly separated
  • Type hints: Good use of type hints in Python code
  • Async/await patterns: Proper use of async views and sync_to_async for Django ORM operations
  • Permission checks: Consistent use of LoginRequiredMixin and PermissionRequiredMixin
  • JavaScript structure: Well-organized Alpine.js component with clear state management

Areas for Improvement

  1. JavaScript Error Handling (upload.js:202-208) - The error is logged to console but not communicated to the backend. Consider sending error telemetry.

  2. Trailing Whitespace (models.py:11-12, widgets.py:9) - Empty lines have unnecessary trailing whitespace.

  3. Commented Code (test_views.py:13-14, 21-22) - Remove commented-out test code or implement these tests.

  4. Unused Function (apps.py:25-29) - create_app_settings() is defined but never called. Document it or remove it.

@claude
Copy link

claude bot commented Oct 30, 2025

Security Concerns

Critical Issues

  1. Anonymization Seed Exposure (views.py:53, upload.js:247-268)

    • The anonymization seed is sent to the client and used with document.cookie
    • Risk: If an attacker obtains the seed, they could potentially reverse the anonymization
    • Recommendation: Keep seed server-side only, or use per-session random seeds that aren't derived from a static configuration value
  2. Cookie Usage in Anonymization (upload.js:261)

    • Using cookies in the anonymization seed is unusual and potentially problematic
    • If session cookies change, the same patient data would anonymize to different pseudonyms
    • Recommendation: Use a more deterministic approach or document why this is necessary
  3. Client-Side Validation Only (upload.js:315-353)

    • isValidSeries validation only happens client-side
    • Server should also validate that all DICOM files in an upload belong to the same patient
    • Malicious users could bypass client validation

Medium Issues

  1. File Size Limits
    • No apparent file size limits on uploads
    • Could lead to DoS attacks or resource exhaustion
    • Recommendation: Add file size and total upload size limits

@claude
Copy link

claude bot commented Oct 30, 2025

Potential Bugs

  1. Race Condition in Upload (upload.js:172-190)

    • Multiple files uploaded sequentially without transaction guarantees
    • If upload fails mid-series, partial data might exist on the server
    • Recommendation: Add server-side cleanup for failed multi-file uploads
  2. Node ID Type Mismatch (urls.py:15, views.py:108)

    • URL parameter is str:node_id but used with DicomServer.objects.aget(id=node_id)
    • If id is an integer field, this could fail
    • Recommendation: Use int:node_id or convert to int explicitly
  3. Progress Bar Logic (upload.js:179-183)

    • The condition newPBValue > currentPBValue is unnecessary; progress should always increase
    • Simplify to just set the value directly
  4. Loose Equality (upload.js:177)

    • Use strict equality: status === 200 instead of status == 200
  5. Dataset Bytes Resource Management (views.py:128-137)

    • Good use of finally, but if an exception occurs before line 128, dataset_bytes might not be defined
    • Consider using context managers

@claude
Copy link

claude bot commented Oct 30, 2025

Performance Considerations

  1. Synchronous File Processing (upload.js:143-163)

    • Each file is read, parsed, anonymized, and uploaded sequentially
    • For large datasets (hundreds of files), this could be very slow
    • Recommendation: Consider batching uploads or parallel processing (with concurrency limits)
  2. Memory Usage (upload.js:144)

    • Loading entire files into memory could cause issues with large DICOM files
    • Modern DICOM can be 500MB+ (e.g., whole-slide imaging)
    • Recommendation: Consider streaming or chunked processing
  3. Executor Thread Pool (views.py:143)

    • Using default executor (None) might limit concurrency
    • Consider using a dedicated thread pool with appropriate sizing
  4. N+1 Validation Pattern (upload.js:320-344)

    • Each file is parsed completely during validation
    • Then parsed again during anonymization (line 146)
    • Recommendation: Cache parsed DICOM data to avoid double-parsing

@claude
Copy link

claude bot commented Oct 30, 2025

Test Coverage

Strengths

  • Excellent acceptance test coverage with Playwright
  • Tests cover happy paths and error cases
  • Good use of fixtures and parametrization
  • Tests validate the end-to-end workflow including DICOM operations

Gaps

  1. Unit Tests for Views (test_views.py)

    • Only 2 basic permission tests
    • Missing tests for: upload_api_view logic, form validation edge cases, error handling paths, HTMX-specific behavior
  2. JavaScript Unit Tests

    • No unit tests for upload.js
    • Consider adding Jest or similar for testing: isValidSeries logic, createAnonymizer function, error handling paths, state management
  3. Security Tests

    • No tests validating: Permission checks for upload_api_view, node accessibility checks, input validation bypasses
  4. Performance Tests

    • No tests for large file uploads
    • No tests for concurrent uploads

@claude
Copy link

claude bot commented Oct 30, 2025

Additional Recommendations

  1. Documentation

    • Add docstrings to complex functions (loadFiles, isValidSeries, etc.)
    • Document the anonymization strategy and why client-side is chosen
    • Add comments explaining the seed + cookie approach
  2. Configuration

    • The required ANONYMIZATION_SEED setting will break existing deployments
    • Consider adding a migration guide or default value for backward compatibility
  3. Vendor Files

    • 23k+ lines of vendor JavaScript added (dcmjs.js)
    • Consider using a CDN or package manager instead of committing vendor files
    • Add a package.json entry for version tracking
  4. Admin Interface (admin.py)

    • Very basic admin registration
    • Consider adding custom admin with readonly fields, search, filters
  5. Logging

    • Good logging in views.py but missing in forms and other components
    • Add structured logging for upload events (success, failure, duration, file count)

@claude
Copy link

claude bot commented Oct 30, 2025

Summary

Approval Recommendation: ✅ Approve with requested changes

This is a solid implementation of a complex feature. The client-side anonymization is innovative, and the test coverage for the happy path is excellent. However, there are important security concerns around seed management that should be addressed before merging to production.

Priority Fixes Before Merge:

  1. 🔴 Critical: Review and redesign the anonymization seed approach
  2. 🟡 High: Add server-side validation for DICOM series consistency
  3. 🟡 High: Fix the node_id type mismatch issue
  4. 🟡 High: Add file size limits

Can Be Addressed in Follow-up PRs:

  • Performance optimizations for large uploads
  • Additional unit test coverage
  • JavaScript test suite
  • Documentation improvements
  • Vendor file management

Great work on this feature! The structure is clean and the acceptance tests demonstrate good quality assurance practices.

@medihack medihack merged commit c9b9515 into main Oct 30, 2025
2 checks passed
@medihack medihack deleted the upload_dev branch October 30, 2025 21:15
@github-project-automation github-project-automation bot moved this from Todo to Done in ADIT Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Client side pseudonymization Upload with websockets and integrated client side pseudonymization

3 participants