Upload feature by lmdulz · Pull Request #80 · openradx/adit

lmdulz · 2024-08-29T15:19:11Z

Closes #19
Closes #27

adit/upload/models.py

adit/upload/static/upload/upload.js

medihack · 2024-09-22T22:07:24Z

adit/upload/templates/upload/upload_job_form.html

+                   @change="toggleUploadButtonVisibility()"
+                   accept=".dcm,.DCM"
+                   webkitdirectory
+                   mozdirectory>


Which browsers are supported? Do we need a hint somewhere what the browser support is?

adit/upload/static/upload/upload.js

lmdulz · 2024-09-23T10:04:35Z

adit/upload/static/upload/upload.js

+        seed
+      );
+
+      return anon;


analog zu genereller ADIT anonymisierung nach anonymisierung PatID und PatNamen mit Pseudonym ersetzen

adit/upload/static/upload/upload.js

…om validation

tasks.py

…b-anonymizer

medihack · 2025-10-21T09:10:38Z

@codex

chatgpt-codex-connector · 2025-10-21T09:16:51Z

Summary

Hardened the upload API to return a 400 response when the dataset payload is missing, preventing crashes on malformed requests. adit/upload/views.pyL103-L137
Refactored the upload form’s Alpine component for consistent method style, stronger validation (including anonymizer seed checks), and clearer upload handling logic. adit/upload/static/upload/upload.jsL3-L337
Documented browser support expectations for directory uploads directly in the user-facing form. adit/upload/templates/upload/upload_job_form.htmlL12-L25

Testing

⚠️ Not run (not requested)

View task →

claude · 2025-10-29T14:26:09Z

Code Review - Upload Feature (PR #80)

Thank you for this substantial contribution implementing the upload feature! This is a significant addition with ~26k lines added. Here's my detailed review:

✅ Strengths

Security Implementation
- Proper permission checks with PermissionRequiredMixin and can_upload_data permission
- CSRF token validation in API requests
- User authentication requirements enforced
- Client-side anonymization before upload (good privacy practice)
- Permission verification for destination node access
Code Quality
- Follows Google Python Style Guide as per CONTRIBUTING.md
- Good separation of concerns (models, views, forms, URLs)
- Type hints used appropriately
- Clean form validation with custom validators
Testing
- Includes both unit tests and acceptance tests
- Permission tests verify security boundaries
- Tests cover key workflows (clear files, stop upload)
User Experience
- Drag-and-drop support for files
- Progress tracking during upload
- Upload cancellation capability
- HTMX integration for smooth UX

🔴 Critical Issues

1. Security: Anonymization Seed Exposure (adit/upload/views.py:50)

context["anon_seed"] = settings.ANONYMIZATION_SEED

Problem: The anonymization seed is passed directly to the client-side JavaScript. If this seed is meant to be secret for consistent anonymization across the system, exposing it to clients is a security vulnerability. Any user could extract this seed and potentially re-identify anonymized data.

Recommendation:

If the seed must be shared, document WHY it's acceptable
Consider per-user or per-session seeds instead
Alternatively, perform anonymization server-side only

2. Error Handling: Silent Failures (adit/upload/views.py:135-137)

except Exception:
    status = 500
    message = "Upload failed"

Problem: Bare except Exception catches ALL exceptions without logging. This makes debugging production issues nearly impossible.

Recommendation:

except Exception as e:
    logger.exception("Upload failed for user %s to node %s", request.user.id, node_id)
    status = 500
    message = "Upload failed"

3. Data Validation: Missing DICOM Validation (adit/upload/views.py:124)

The uploaded dataset is read without validation:

dataset = read_dataset(dataset_bytes)

Problem: No validation that the uploaded file is actually valid DICOM data. Malicious or corrupted files could cause issues.

Recommendation: Add try-except around DICOM parsing and validate structure.

⚠️ High Priority Issues

4. JavaScript: Sequential Uploads Block UI (upload.js:152-198)

for (const set of datasets) {
  // ... anonymize and upload ...
  status = await uploadData({...});
}

Problem: Uploads happen sequentially in a for-loop, blocking the browser for potentially long periods. For large studies (100+ images), this creates poor UX.

Recommendation: Implement batch uploads with concurrency:

const uploadPromises = datasets.map(async (set) => {
  // anonymize and upload
});
await Promise.allSettled(uploadPromises.slice(0, 5)); // 5 concurrent uploads

5. Memory Management: All Files Loaded Into Memory (upload.js:127-130)

for (const fileEntry of files) {
  await this.fileHandler(fileEntry, datasets);
}

Problem: ALL files are loaded into memory as ArrayBuffers before processing begins. For a 500-file study, this could exceed browser memory limits.

Recommendation: Stream-process files instead of loading all at once.

6. Data Integrity: isValidSeries Logic Issue (upload.js:316-351)

return (
  patientIDs.size == 1 &&
  patientNames.size == 1 &&
  patientBirthdates.size == 1
);

Problem: This validation ONLY checks if all files have the SAME patient info. It doesn't verify:

If any required DICOM tags are present
If the study belongs to the patient
Edge case: If ALL files are missing patient info, the sets would be empty (size 0) and validation would pass!

Recommendation: Add explicit checks:

// Ensure at least one valid patient was found
if (patientIDs.size === 0) {
  return false;
}
return patientIDs.size === 1 && patientNames.size === 1 && patientBirthdates.size === 1;

7. Race Condition: uploadResultText (upload.js:193, 206, 211)

Multiple code paths set uploadResultText without proper state management. If network responses arrive out of order, the wrong message could display.

💡 Medium Priority Issues

8. Code Duplication: finishUploadComplete/Incomplete (upload.js:220-252)

Both functions share similar logic. Consider consolidating.

9. Missing Input Sanitization (forms.py:20-24)

While no_control_chars_validator and no_backslash_char_validator are used, the pseudonym is directly inserted into DICOM tags without additional sanitization. Ensure validators are comprehensive.

10. Commented Test Code (test_views.py:13-14, 21-22)

# response = client.post(reverse("data_upload", kwargs={"node_id": 1}))
# assert response.status_code == 403

Recommendation: Either implement these tests or remove the comments.

11. TODO in core code (adit/core/utils/dicom_operator.py:53)

# TODO: also make retries and timeouts possible in DicomWebConnector

If this TODO is addressed in this PR, great! If not, consider if retry logic is important for upload reliability.

12. Admin Registration (admin.py:6)

admin.site.register(UploadSettings, admin.ModelAdmin)

Using the default ModelAdmin works but consider creating a custom admin class for better control.

🔍 Low Priority / Suggestions

13. Documentation

README.md moves upload from "Upcoming features" to current features - good!
Consider adding docstrings to upload_api_view and JavaScript functions
Document the anonymization approach and seed requirements

14. Performance: Client-Side DICOM Parsing

Reading DICOM files twice (once for validation, once for anonymization) is inefficient. Consider combining these operations.

15. JavaScript Error Messages

Error messages like "Upload Failed due to an Error" are not user-friendly. Consider more specific guidance.

16. Unused Variable (forms.py:29)

self.action = kwargs.pop("action")

The action parameter is stored but doesn't appear to be used in the form.

17. Vendor Files

Large vendor files (dcmjs.js - 23k lines) are committed. Consider:

Using a CDN
Using npm/package management
Adding a comment indicating version and source

📋 Testing Gaps

Consider adding tests for:

Upload with invalid DICOM files
Upload with mixed patient data (should fail validation)
Upload to unauthorized destination node
Upload with missing anonymization seed
Concurrent uploads
Upload of very large files
Network failure scenarios

🎯 Summary & Recommendations

Must Fix Before Merge:

Address anonymization seed security (Fully automate dev setup with Gitpod #1)
Add proper error logging (wip #2)
Fix isValidSeries edge case (Polymorphism #6)

Should Fix Before Merge:
4. Memory management for large uploads (#5)
5. Add DICOM validation (#3)

Consider for Follow-up PR:
6. Performance improvements (concurrent uploads, streaming)
7. Enhanced error messages
8. Additional test coverage

📝 Final Notes

This is a solid implementation of a complex feature! The architecture is clean, security is mostly well-handled, and the UX considerations (drag-drop, progress, cancel) are excellent. The main concerns are around the anonymization seed exposure and robustness for large uploads.

Great work overall! Please address the critical issues and consider the high-priority suggestions. Happy to review again once updated.

Reviewed following: Google Python Style Guide as per CONTRIBUTING.md

claude · 2025-10-30T11:19:22Z

Pull Request Review: Upload Feature (#80)

This is a comprehensive PR adding a DICOM upload feature with client-side anonymization. Overall, this is well-implemented with good test coverage. Below are my findings organized by category:

✅ Strengths

Code Quality

Clean separation of concerns with dedicated modules (views, forms, models, utils)
Good use of Django's permission system with can_upload_data permission
Comprehensive acceptance tests covering multiple scenarios
TypeScript-style type checking in JavaScript with proper error handling
Follows Google Python Style Guide as per project standards

Test Coverage

Excellent acceptance test coverage (8 test cases) including:
- Happy path (successful upload)
- Error cases (missing pseudonym, missing destination)
- Edge cases (unsupported file types, stopping upload)
- Data validation (pseudonym verification with parametrized tests)
Permission-based access control tests

⚠️ Issues & Recommendations

🔴 Security Concerns (High Priority)

Weak Anonymization Seed Handling (adit/upload/static/upload/upload.js:256-260)

Issue: seed += document.cookie uses string concatenation instead of proper cryptographic mixing
Risk: Predictable seed values compromise anonymization security
Recommendation: Use a proper HMAC or key derivation function (e.g., Web Crypto API's deriveBits)

// Better approach:
const encoder = new TextEncoder();
const seedData = encoder.encode(seed + document.cookie);
const hashBuffer = await crypto.subtle.digest('SHA-256', seedData);
const hashArray = Array.from(new Uint8Array(hashBuffer));
const hashHex = hashArray.map(b => b.toString(16).padStart(2, '0')).join('');

Empty Anonymization Seed Default (adit/settings/base.py:458)
- Issue: ANONYMIZATION_SEED = env.str("ANONYMIZATION_SEED", default="") allows empty seeds
- Risk: Empty seed breaks anonymization (JS throws error, but should be caught earlier)
- Recommendation: Fail fast in Django settings if seed is empty in production
```
ANONYMIZATION_SEED = env.str("ANONYMIZATION_SEED", default="")
if not ANONYMIZATION_SEED and not DEBUG:
    raise ImproperlyConfigured("ANONYMIZATION_SEED must be set in production")
```
Seed Exposure in Templates (adit/upload/views.py:53)
- Issue: Seed passed to template context could be visible in HTML source
- Risk: If seed is exposed, anonymization is compromised
- Recommendation: Verify the template properly protects this value (e.g., in a script tag with proper escaping)

🟡 Potential Bugs (Medium Priority)

Memory Leak Risk in File Upload (adit/upload/static/upload/upload.js:143-163)
- Issue: All DICOM files are loaded into memory sequentially with await file.arrayBuffer()
- Risk: Large studies (hundreds of images) could cause browser memory issues
- Recommendation: Add file size validation or implement chunked processing with periodic cleanup
```
const MAX_TOTAL_SIZE = 500 * 1024 * 1024; // 500MB limit
const totalSize = Array.from(files).reduce((sum, f) => sum + f.size, 0);
if (totalSize > MAX_TOTAL_SIZE) {
    throw new Error(`Total size exceeds limit: ${totalSize} bytes`);
}
```

Incomplete Cleanup on Error (adit/upload/views.py:120-138)

Issue: If read_dataset fails, the uploaded file is still in memory but never explicitly closed
Risk: Memory leak on repeated failed uploads
Recommendation: Use proper context managers or explicit cleanup

try:
    dataset_bytes = BytesIO(uploaded_file.read())
    dataset = read_dataset(dataset_bytes)
except Exception as e:
    logger.exception(...)
    return HttpResponse(status=400, content="Invalid dataset")
finally:
    dataset_bytes.close()  # Ensure cleanup

Loose Equality Check (adit/upload/static/upload/upload.js:192)
- Issue: if (loadedFiles == dataset_length) uses loose equality
- Risk: Type coercion could cause unexpected behavior
- Recommendation: Use strict equality: if (loadedFiles === dataset_length)
Unused Function (adit/upload/apps.py:25-29)
- Issue: create_app_settings() is defined but never called
- Recommendation: Either call it in ready() or remove if not needed

🟠 Performance Considerations (Medium Priority)

Synchronous Upload Loop (adit/upload/static/upload/upload.js:143-191)
- Issue: Files are uploaded sequentially, one at a time
- Impact: Slow for large studies (e.g., 200 files × 2 seconds = 6+ minutes)
- Recommendation: Implement concurrent uploads with a semaphore (e.g., 5 concurrent uploads)
N+1 Validation Pattern (adit/upload/static/upload/upload.js:309-346)
- Issue: isValidSeries reads all files' DICOM headers before upload starts
- Impact: Adds latency (read all files twice: once for validation, once for upload)
- Recommendation: Validate during upload loop to reduce latency
Blocking Async View (adit/upload/views.py:140-141)
- Issue: await loop.run_in_executor(None, operator.upload_images, [dataset]) blocks the executor thread
- Risk: Under high load, could exhaust thread pool
- Recommendation: Consider using a task queue (Celery) for uploads

🔵 Code Quality & Style (Low Priority)

Missing Type Hints (adit/upload/admin.py:6)
- Issue: admin.site.register(UploadSettings, admin.ModelAdmin) could use custom ModelAdmin class
- Recommendation: Add custom admin class with proper type hints
Magic Numbers (adit/upload/static/upload/upload.js:218, 241)
- Issue: Hardcoded 5000 millisecond timeout
- Recommendation: Extract to named constant: const UPLOAD_COMPLETE_TIMEOUT = 5000;
Inconsistent Error Messages
- JavaScript uses generic messages like "Upload Failed" without details
- Recommendation: Include error details from server response for better debugging
Empty __init__.py Files
- Multiple empty __init__.py files added (migrations, utils, tests)
- This is fine but could include docstrings for module purposes

📋 Additional Observations

Positive Aspects:

✅ Proper permission checks at both view and API level
✅ CSRF protection properly implemented
✅ Good separation between sync and async code
✅ Client-side validation reduces server load
✅ Progressive enhancement with progress bars and stop functionality
✅ Parametrized tests demonstrate good testing practices

Documentation:

README updated to remove "upcoming features" (good!)
Consider adding:
- JSDoc comments for complex JS functions
- Docstrings for Python views explaining upload flow
- Security notes about anonymization seed management

🎯 Recommended Action Items (Prioritized)

Must Fix Before Merge:

❗ Fix anonymization seed concatenation security issue
❗ Add validation to prevent empty ANONYMIZATION_SEED in production
❗ Replace loose equality with strict equality in JavaScript

Should Fix:

Add memory limits for file uploads
Improve error handling and cleanup in upload_api_view
Either implement or remove create_app_settings()

Nice to Have:

Implement concurrent uploads for better performance
Add more detailed error messages to users
Extract magic numbers to constants

📊 Summary

Verdict: Approve with required changes

This PR adds significant value by implementing the upload feature that users need. The code quality is generally high, tests are comprehensive, and the architecture is sound. However, the security issues around anonymization seed handling must be addressed before merging to production.

The main concerns are:

Security: Weak seed mixing mechanism
Performance: Sequential uploads could be slow for large studies
Robustness: Memory management needs improvement

Once the high-priority security issues are resolved, this will be a solid addition to the codebase.

Great work on this feature! 🎉 Let me know if you need clarification on any of these points.

claude · 2025-10-30T11:41:33Z

Pull Request Review: Upload Feature

This PR implements a DICOM upload portal with client-side anonymization. Overall, the feature is well-implemented with good test coverage.

Code Quality & Best Practices

Strengths

Good separation of concerns: Models, views, forms, and business logic are properly separated
Type hints: Good use of type hints in Python code
Async/await patterns: Proper use of async views and sync_to_async for Django ORM operations
Permission checks: Consistent use of LoginRequiredMixin and PermissionRequiredMixin
JavaScript structure: Well-organized Alpine.js component with clear state management

Areas for Improvement

JavaScript Error Handling (upload.js:202-208) - The error is logged to console but not communicated to the backend. Consider sending error telemetry.
Trailing Whitespace (models.py:11-12, widgets.py:9) - Empty lines have unnecessary trailing whitespace.
Commented Code (test_views.py:13-14, 21-22) - Remove commented-out test code or implement these tests.
Unused Function (apps.py:25-29) - create_app_settings() is defined but never called. Document it or remove it.

claude · 2025-10-30T11:41:41Z

Security Concerns

Critical Issues

Anonymization Seed Exposure (views.py:53, upload.js:247-268)
- The anonymization seed is sent to the client and used with document.cookie
- Risk: If an attacker obtains the seed, they could potentially reverse the anonymization
- Recommendation: Keep seed server-side only, or use per-session random seeds that aren't derived from a static configuration value
Cookie Usage in Anonymization (upload.js:261)
- Using cookies in the anonymization seed is unusual and potentially problematic
- If session cookies change, the same patient data would anonymize to different pseudonyms
- Recommendation: Use a more deterministic approach or document why this is necessary
Client-Side Validation Only (upload.js:315-353)
- isValidSeries validation only happens client-side
- Server should also validate that all DICOM files in an upload belong to the same patient
- Malicious users could bypass client validation

Medium Issues

File Size Limits
- No apparent file size limits on uploads
- Could lead to DoS attacks or resource exhaustion
- Recommendation: Add file size and total upload size limits

claude · 2025-10-30T11:41:49Z

Potential Bugs

Race Condition in Upload (upload.js:172-190)
- Multiple files uploaded sequentially without transaction guarantees
- If upload fails mid-series, partial data might exist on the server
- Recommendation: Add server-side cleanup for failed multi-file uploads
Node ID Type Mismatch (urls.py:15, views.py:108)
- URL parameter is str:node_id but used with DicomServer.objects.aget(id=node_id)
- If id is an integer field, this could fail
- Recommendation: Use int:node_id or convert to int explicitly
Progress Bar Logic (upload.js:179-183)
- The condition newPBValue > currentPBValue is unnecessary; progress should always increase
- Simplify to just set the value directly
Loose Equality (upload.js:177)
- Use strict equality: status === 200 instead of status == 200
Dataset Bytes Resource Management (views.py:128-137)
- Good use of finally, but if an exception occurs before line 128, dataset_bytes might not be defined
- Consider using context managers

claude · 2025-10-30T11:41:58Z

Performance Considerations

Synchronous File Processing (upload.js:143-163)
- Each file is read, parsed, anonymized, and uploaded sequentially
- For large datasets (hundreds of files), this could be very slow
- Recommendation: Consider batching uploads or parallel processing (with concurrency limits)
Memory Usage (upload.js:144)
- Loading entire files into memory could cause issues with large DICOM files
- Modern DICOM can be 500MB+ (e.g., whole-slide imaging)
- Recommendation: Consider streaming or chunked processing
Executor Thread Pool (views.py:143)
- Using default executor (None) might limit concurrency
- Consider using a dedicated thread pool with appropriate sizing
N+1 Validation Pattern (upload.js:320-344)
- Each file is parsed completely during validation
- Then parsed again during anonymization (line 146)
- Recommendation: Cache parsed DICOM data to avoid double-parsing

claude · 2025-10-30T11:42:06Z

Test Coverage

Strengths

Excellent acceptance test coverage with Playwright
Tests cover happy paths and error cases
Good use of fixtures and parametrization
Tests validate the end-to-end workflow including DICOM operations

Gaps

Unit Tests for Views (test_views.py)
- Only 2 basic permission tests
- Missing tests for: upload_api_view logic, form validation edge cases, error handling paths, HTMX-specific behavior
JavaScript Unit Tests
- No unit tests for upload.js
- Consider adding Jest or similar for testing: isValidSeries logic, createAnonymizer function, error handling paths, state management
Security Tests
- No tests validating: Permission checks for upload_api_view, node accessibility checks, input validation bypasses
Performance Tests
- No tests for large file uploads
- No tests for concurrent uploads

claude · 2025-10-30T11:42:13Z

Additional Recommendations

Documentation
- Add docstrings to complex functions (loadFiles, isValidSeries, etc.)
- Document the anonymization strategy and why client-side is chosen
- Add comments explaining the seed + cookie approach
Configuration
- The required ANONYMIZATION_SEED setting will break existing deployments
- Consider adding a migration guide or default value for backward compatibility
Vendor Files
- 23k+ lines of vendor JavaScript added (dcmjs.js)
- Consider using a CDN or package manager instead of committing vendor files
- Add a package.json entry for version tracking
Admin Interface (admin.py)
- Very basic admin registration
- Consider adding custom admin with readonly fields, search, filters
Logging
- Good logging in views.py but missing in forms and other components
- Add structured logging for upload events (success, failure, duration, file count)

claude · 2025-10-30T11:42:19Z

Summary

Approval Recommendation: ✅ Approve with requested changes

This is a solid implementation of a complex feature. The client-side anonymization is innovative, and the test coverage for the happy path is excellent. However, there are important security concerns around seed management that should be addressed before merging to production.

Priority Fixes Before Merge:

🔴 Critical: Review and redesign the anonymization seed approach
🟡 High: Add server-side validation for DICOM series consistency
🟡 High: Fix the node_id type mismatch issue
🟡 High: Add file size limits

Can Be Addressed in Follow-up PRs:

Performance optimizations for large uploads
Additional unit test coverage
JavaScript test suite
Documentation improvements
Vendor file management

Great work on this feature! The structure is clean and the acceptance tests demonstrate good quality assurance practices.

lmdulz added 9 commits August 28, 2024 15:01

copy upload files into branch

ccb2f87

deleted a comment

ffd15d1

delete package-lock.json and package.json files

09b86eb

parse seed env variable for Anonymizer to Javascript

e27f800

cleanup for CI

c257f0c

Merge remote-tracking branch 'origin/main' into upload_dev

ef10a09

Use model pk instead of id and adapt to upgraded packages

d884d2a

adapt tests to package updates

34b4174

adjust upload tests for CI

eddce09

medihack reviewed Sep 22, 2024

View reviewed changes