DICOM File Support -- Phase 1¶
Status: Proposed Author: SD Created: 2026-04-10 GitHub Issue: #147 Branch: feat/dicom-support
1. Overview and Motivation¶
What¶
Accept DICOM (.dcm/.dicom) medical imaging files in the Curaway upload flow, extract structured metadata (body part, modality, study date, laterality, patient demographics), parse DICOM-SR (Structured Reports) for radiology findings, de-identify patient PII from headers before provider forwarding, auto-match imaging metadata against procedure requirements, and render a thumbnail preview in the EHR drawer.
Why¶
- Patients have DICOM files. Radiology departments frequently provide imaging on CD/USB as DICOM, not PDF. Patients currently cannot upload these -- they must convert to PDF first, losing structured metadata.
- Structured metadata > OCR. DICOM headers contain machine-readable body part, modality, study date, and laterality. This is 100% accurate data that the OCR + LLM pipeline would otherwise have to infer from free text with lower confidence.
- DICOM-SR carries radiology findings. Structured Reports embedded in DICOM files contain the radiologist's findings, measurements, and impressions in a parseable format -- no OCR needed.
- Procedure requirement matching. Auto-matching
BodyPartExamined=KNEE+Modality=MRagainst a TKR procedure's "Knee MRI" requirement is deterministic and instant. - De-identification is mandatory. DICOM headers contain patient name, DOB, and medical record numbers. These must be stripped before forwarding to international providers (GDPR + HIPAA Safe Harbor).
- Cross-border value. Patients traveling from US/UK/UAE to India/Turkey/Thailand often carry imaging on disc. Supporting DICOM directly removes a friction point in the onboarding flow.
What This Is Not¶
- Not a full PACS viewer. No windowing, level adjustment, or multi-frame cine playback.
- Not multi-file DICOM series handling (Phase 2).
- Not DICOM networking (C-STORE, C-FIND, WADO-RS) -- that is post-Series A.
- Not DICOM-RT (radiation therapy) support.
2. Architecture Decisions¶
Where dicom_parser.py Fits¶
New service at app/services/dicom_parser.py -- sits alongside document_processing.py in the services layer. It is a pure utility module: takes bytes in, returns structured dict out. No DB access, no side effects.
Upload flow (existing)
|
v
document_service.confirm_upload()
|
+-- file_type == "dicom" ?
| |
| YES --> dicom_parser.parse_dicom(file_bytes)
| | returns: {metadata, sr_findings, thumbnail_bytes, deidentified_bytes}
| |
| +-- Store deidentified_bytes to R2 (overwrite original)
| +-- Store thumbnail to R2 ({storage_key}_thumb.png)
| +-- Save metadata + sr_findings to doc.extracted_data
| +-- Skip OCR, go straight to run_post_ocr_pipeline() with extracted text from SR
| |
| NO --> existing OCR pipeline (unchanged)
|
v
run_post_ocr_pipeline() <-- receives SR text or metadata-derived text
|
v
Clinical Context Agent --> FHIR resources --> requirement matching --> EHR rebuild
Key Design Choices¶
| Decision | Choice | Rationale |
|---|---|---|
| DICOM library | pydicom 2.4+ | Industry standard, pure Python, no C dependencies, active maintenance, 15K+ GitHub stars |
| Thumbnail generation | pydicom + Pillow | pydicom reads pixel data, Pillow converts to PNG. No GDAL/VTK heavyweight dependencies. |
| De-identification | Custom tag stripper (Safe Harbor) | Full DICOM anonymization suites (deid, DicomAnonymizer) are overkill. We strip a known tag list -- deterministic, auditable, <50 lines. |
| SR parsing | pydicom ContentSequence traversal | No need for highdicom at MVP. Walk the SR tree, extract TEXT/NUM/CODE value types. |
| Storage | De-identified bytes overwrite original in R2 | Never persist PII-bearing DICOM on our infrastructure. Original is replaced. |
| Thumbnail format | PNG, 256x256 max, 8-bit grayscale | Small enough for inline preview, good enough for "is this the right scan?" |
| Feature flag | dicom_support_enabled (Flagsmith) |
Kill switch for the entire feature. Default off until tested. |
3. Implementation Checklist¶
Tier 1: Opus (Architecture, Clinical Logic, Security)¶
- [ ] O1:
app/services/dicom_parser.py-- core parser module [NEW FILE] parse_dicom(file_bytes: bytes) -> DicomParseResult-- main entry pointextract_metadata(ds: pydicom.Dataset) -> dict-- pull structured tagsextract_sr_findings(ds: pydicom.Dataset) -> list[SRFinding]-- walk SR content treedeidentify(ds: pydicom.Dataset) -> pydicom.Dataset-- strip PII tags per Safe Harborgenerate_thumbnail(ds: pydicom.Dataset) -> bytes | None-- pixel data to PNG-
build_text_representation(metadata: dict, findings: list) -> str-- synthesize text for Clinical Context Agent -
[ ] O2: De-identification tag list and logic [IN dicom_parser.py]
- Implement Safe Harbor tag stripping (see Section 6)
- Replace stripped values with safe placeholders (e.g., "DEIDENTIFIED")
- Preserve all clinical tags (BodyPartExamined, Modality, StudyDate, etc.)
- Write to new Dataset, never modify-in-place on the input
-
Log deidentification event to audit table (tag count stripped, no PII values)
-
[ ] O3: DICOM-SR parsing logic [IN dicom_parser.py]
- Walk
ContentSequencerecursively - Extract TEXT, NUM, CODE, PNAME, DATE, TIME, UIDREF, COMPOSITE value types
- Map SR concept names to clinical meaning (see Section 7)
-
Return structured findings list
-
[ ] O4: Wire into document_processing.py [MODIFY]
- Add DICOM branch in the processing pipeline
- Skip OCR for DICOM files
- Pass SR text + metadata-derived text to
run_post_ocr_pipeline() -
Set
ocr_method = "dicom_metadata"for DICOM files -
[ ] O5: Wire into attachment_handler.py [MODIFY]
- Add DICOM detection in
process_attachments() - When
file_type == "dicom", skip inline OCR, use extracted_data directly -
Set
report_type = "imaging"for Clinical Context Agent -
[ ] O6: Auto-match body part + modality against procedure requirements [MODIFY
requirement_matcher.py] - New function:
match_dicom_metadata_to_requirements(metadata, proc_reqs) -> list[dict] - Match
BodyPartExamined+Modalityagainst requirement descriptions -
Higher confidence than LLM matching (deterministic, 0.95+ confidence)
-
[ ] O7: FHIR ImagingStudy resource generation [MODIFY
clinical_context.pyprompts] - When DICOM metadata is available, generate FHIR ImagingStudy (not just Condition/Observation)
- Map modality, body part, study date, accession number to ImagingStudy fields
- Store via
fhir_service.create_fhir_resource()
Tier 2: Sonnet (Mechanical Implementation, Config, Tests)¶
- [ ] S1: Update
config/guardrails.yaml[MODIFY] - Add
.dcm,.dicomtofrontend.allowed_extensions - Add
application/dicomtobackend.allowed_mime_types(already present, verify) -
Add DICOM-specific medical keywords:
dicom,modality,body part,series -
[ ] S2: Update
app/services/file_validator.py[MODIFY] - Add
.dcm,.dicomto extension validation - Add
application/dicomMIME type mapping -
DICOM files skip medical keyword check (metadata provides clinical context natively)
-
[ ] S3: Update presign endpoint [MODIFY
app/routers/documents.py] - Accept
.dcm/.dicomextensions in presign request validation -
Set
file_type = "dicom"in DocumentReference creation -
[ ] S4: Update confirm_upload flow [MODIFY
app/services/document_service.py] - Detect DICOM file type
- Call
dicom_parser.parse_dicom()instead of queuing OCR - Store deidentified bytes back to R2
- Store thumbnail to R2
- Populate
extracted_datawith DICOM metadata - Set
document_category = "imaging"automatically -
Queue
run_post_ocr_pipeline()with SR text -
[ ] S5: Update QStash OCR callback [MODIFY
app/routers/internal.py] - In
process_ocr(), detect DICOM file type before starting OCR tiers - If DICOM: call
dicom_parser.parse_dicom(), skip all OCR tiers -
Feed results into
run_post_ocr_pipeline()as normal -
[ ] S6: Pydantic schemas for DICOM data [NEW FILE
app/schemas/dicom.py] DicomMetadata-- body_part, modality, study_date, laterality, etc.SRFinding-- concept_name, value_type, value, unit, finding_type-
DicomParseResult-- metadata, sr_findings, thumbnail_key, text_representation, deidentification_summary -
[ ] S7: Frontend -- accept DICOM file types [MODIFY
ConversationApp.tsx] - Update
accept=attribute: add.dcm,.dicom - Update frontend file validation (extension list)
-
Add DICOM icon/badge for file attachment cards
-
[ ] S8: Frontend -- DICOM thumbnail preview [MODIFY EHR drawer components]
- When document has
thumbnail_keyin extracted_data, fetch and display - Fallback to generic imaging icon when no thumbnail available
-
256x256 max, grayscale, with body part + modality label overlay
-
[ ] S9: Flagsmith feature flag [CONFIG]
- Create
dicom_support_enabledflag in Flagsmith - Gate all DICOM-specific code paths behind this flag
-
Default: OFF until integration testing passes
-
[ ] S10: Add
pydicomandPillowto requirements [MODIFYrequirements.txt] pydicom>=2.4.0,<3.0Pillow>=10.0.0,<11.0(likely already present for other image handling)
4. File-by-File Changes¶
New Files¶
| File | Purpose |
|---|---|
app/services/dicom_parser.py |
Core DICOM parsing, de-identification, thumbnail generation, SR extraction |
app/schemas/dicom.py |
Pydantic models for DICOM metadata and SR findings |
tests/test_dicom_parser.py |
Unit tests for parser, de-identification, SR parsing |
tests/test_dicom_pipeline.py |
Integration tests for DICOM flow through document pipeline |
tests/fixtures/sample.dcm |
Minimal valid DICOM file for testing (synthetic, no real PHI) |
tests/fixtures/sample_sr.dcm |
Minimal DICOM-SR file for SR parsing tests |
Modified Files¶
| File | Change |
|---|---|
config/guardrails.yaml |
Add .dcm, .dicom to frontend allowed_extensions |
app/services/document_processing.py |
DICOM branch before OCR, ocr_method="dicom_metadata" |
app/agents/attachment_handler.py |
DICOM detection, skip inline OCR, set report_type="imaging" |
app/routers/internal.py |
DICOM detection in process_ocr(), skip OCR tiers |
app/services/document_service.py |
DICOM handling in confirm_upload() |
app/routers/documents.py |
Accept .dcm/.dicom in presign |
app/services/file_validator.py |
DICOM extension + MIME validation |
app/services/requirement_matcher.py |
Deterministic DICOM metadata matching |
app/agents/clinical_context.py |
FHIR ImagingStudy generation from DICOM metadata |
requirements.txt |
Add pydicom>=2.4.0 |
curaway-health-navigator/src/pages/ConversationApp.tsx |
Accept .dcm,.dicom in file input |
5. Data Model¶
extracted_data JSONB Schema for DICOM Documents¶
When a DICOM file is processed, document_references.extracted_data will contain:
{
"source_type": "dicom",
"dicom_metadata": {
"modality": "MR",
"modality_description": "Magnetic Resonance",
"body_part_examined": "KNEE",
"laterality": "L",
"study_date": "2026-03-15",
"study_description": "MRI LEFT KNEE WITHOUT CONTRAST",
"series_description": "SAG PD FAT SAT",
"institution_name": "DEIDENTIFIED",
"referring_physician": "DEIDENTIFIED",
"accession_number": "DEIDENTIFIED",
"manufacturer": "SIEMENS",
"station_name": "MRC35273",
"slice_thickness": 3.0,
"pixel_spacing": [0.35, 0.35],
"rows": 512,
"columns": 512,
"bits_allocated": 16,
"photometric_interpretation": "MONOCHROME2",
"number_of_frames": 1
},
"sr_findings": [
{
"concept_name": "Finding",
"value_type": "TEXT",
"value": "Moderate tricompartmental osteoarthritis with near-complete loss of medial compartment cartilage",
"finding_type": "impression"
},
{
"concept_name": "Measurement",
"value_type": "NUM",
"value": 2.3,
"unit": "mm",
"finding_type": "measurement",
"measurement_site": "Medial meniscus tear"
}
],
"thumbnail_key": "tenant-001/patient-abc/doc-xyz_thumb.png",
"deidentification_summary": {
"tags_stripped": 18,
"tags_preserved": 45,
"method": "safe_harbor_v1"
},
"analysis_status": "completed",
"extracted_entities": [],
"observations": [],
"matched_requirements": []
}
FHIR Mapping¶
| DICOM Tag | FHIR Resource | FHIR Field |
|---|---|---|
| Modality (0008,0060) | ImagingStudy | series[0].modality.code |
| BodyPartExamined (0018,0015) | ImagingStudy | series[0].bodySite.display |
| StudyDate (0008,0020) | ImagingStudy | started |
| StudyDescription (0008,1030) | ImagingStudy | description |
| AccessionNumber (0008,0050) | ImagingStudy | identifier[0].value (deidentified) |
| Laterality (0020,0060) | ImagingStudy | series[0].laterality.code |
| NumberOfFrames (0028,0008) | ImagingStudy | numberOfInstances |
| SR Finding (impression) | Condition or DiagnosticReport | conclusion / code |
| SR Finding (measurement) | Observation | valueQuantity |
FHIR ImagingStudy resource will be created with status: "available" and linked to the patient via subject reference. SR findings feed into the existing Clinical Context Agent flow, which generates Condition and Observation resources.
6. De-Identification: Safe Harbor DICOM Tag Stripping¶
Tags to Strip (DICOM PS3.15 Table E.1-1, Safe Harbor subset)¶
These tags contain direct patient identifiers and must be removed or replaced with "DEIDENTIFIED" before storage or forwarding.
| Tag | Name | Action |
|---|---|---|
| (0010,0010) | PatientName | Replace with "DEIDENTIFIED" |
| (0010,0020) | PatientID | Replace with "DEIDENTIFIED" |
| (0010,0030) | PatientBirthDate | Remove |
| (0010,0040) | PatientSex | Preserve (clinically relevant, not identifying alone) |
| (0010,1000) | OtherPatientIDs | Remove |
| (0010,1001) | OtherPatientNames | Remove |
| (0010,1010) | PatientAge | Preserve (clinically relevant) |
| (0010,1020) | PatientSize | Preserve (height, clinically relevant) |
| (0010,1030) | PatientWeight | Preserve (clinically relevant) |
| (0010,21B0) | AdditionalPatientHistory | Remove (may contain narrative PII) |
| (0008,0050) | AccessionNumber | Replace with "DEIDENTIFIED" |
| (0008,0080) | InstitutionName | Replace with "DEIDENTIFIED" |
| (0008,0081) | InstitutionAddress | Remove |
| (0008,0090) | ReferringPhysicianName | Replace with "DEIDENTIFIED" |
| (0008,1048) | PhysiciansOfRecord | Remove |
| (0008,1050) | PerformingPhysicianName | Remove |
| (0008,1060) | NameOfPhysiciansReadingStudy | Remove |
| (0008,1070) | OperatorsName | Remove |
| (0010,0050) | PatientInsurancePlanCode | Remove |
| (0010,2154) | PatientTelephoneNumbers | Remove |
| (0010,2160) | EthnicGroup | Remove |
| (0010,21F0) | PatientReligiousPreference | Remove |
| (0020,000D) | StudyInstanceUID | Replace with generated UID |
| (0020,000E) | SeriesInstanceUID | Replace with generated UID |
| (0008,0018) | SOPInstanceUID | Replace with generated UID |
| (0040,A123) | PersonName (in SR) | Replace with "DEIDENTIFIED" |
| (0032,1032) | RequestingPhysician | Remove |
| (0032,1060) | RequestedProcedureDescription | Preserve (clinically relevant) |
Tags to Preserve (Clinical Value)¶
All of these stay intact -- they carry clinical meaning without identifying the patient:
- Modality (0008,0060)
- BodyPartExamined (0018,0015)
- StudyDate (0008,0020) -- Note: study date is preserved because knowing when the imaging was done is critical for procedure requirement matching (validity windows). Date alone does not identify a patient under Safe Harbor.
- StudyDescription (0008,1030)
- SeriesDescription (0008,103E)
- Laterality (0020,0060)
- ImageLaterality (0020,0062)
- All pixel data tags
- All acquisition parameter tags (SliceThickness, PixelSpacing, etc.)
- Manufacturer, StationName, SoftwareVersions
Implementation Notes¶
- Use pydicom's
Dataset.walk()to traverse all sequences (including nested SR content) - After stripping, call
ds.save_as()to produce clean bytes - Generate new UIDs using
pydicom.uid.generate_uid()(maintains DICOM validity) - Log an audit event:
{"event_type": "DICOM_DEIDENTIFIED", "tags_stripped": N, "document_id": "..."} - Never log the stripped values themselves
7. DICOM-SR Parsing¶
Supported SR Template Types (Phase 1)¶
| SR Type | IOD | SOP Class UID | Priority |
|---|---|---|---|
| Basic Text SR | 1.2.840.10008.5.1.4.1.1.88.11 | Comprehensive | Must have |
| Enhanced SR | 1.2.840.10008.5.1.4.1.1.88.22 | Comprehensive | Must have |
| Comprehensive SR | 1.2.840.10008.5.1.4.1.1.88.33 | Comprehensive | Must have |
| Comprehensive 3D SR | 1.2.840.10008.5.1.4.1.1.88.34 | Comprehensive | Nice to have |
| Key Object Selection | 1.2.840.10008.5.1.4.1.1.88.59 | N/A | Skip (no clinical text) |
SR Content Tree Traversal¶
DICOM-SR stores findings as a tree of Content Items. Each item has a ValueType and a ConceptNameCodeSequence. The parser walks this tree recursively:
ContentSequence
+-- Container: "Imaging Report"
+-- Container: "Findings"
| +-- TEXT: "Moderate tricompartmental osteoarthritis..."
| +-- NUM: 2.3 mm (meniscus tear measurement)
| +-- CODE: SNOMED 396230008 (osteoarthritis)
+-- Container: "Impression"
+-- TEXT: "Near-complete loss of medial compartment cartilage"
Value Type Handling¶
| ValueType | Extraction | Maps To |
|---|---|---|
| TEXT | Direct string extraction | Finding text, impression, recommendation |
| NUM | Value + MeasurementUnitsCodeSequence | Observation with valueQuantity |
| CODE | CodingSchemeDesignator + CodeValue + CodeMeaning | Condition code (ICD/SNOMED if present) |
| PNAME | De-identify, do not extract | Stripped |
| DATE | Parse as ISO date | Finding date |
| TIME | Parse as ISO time | Finding time |
| UIDREF | De-identify | Stripped |
| COMPOSITE | Reference to another DICOM object | Log reference, do not follow |
| IMAGE | Reference to image frame | Log reference, do not follow |
Finding Classification¶
SR findings are classified into types for downstream processing:
| Finding Type | Heuristic | FHIR Mapping |
|---|---|---|
impression |
Under "Impression" or "Conclusion" container | DiagnosticReport.conclusion |
finding |
Under "Findings" container | Condition or Observation |
measurement |
NUM value type with unit | Observation.valueQuantity |
recommendation |
Under "Recommendation" container | CarePlan.activity (future) |
coded_diagnosis |
CODE value type with SNOMED/ICD | Condition.code |
Fallback for Non-SR DICOM¶
Most DICOM files patients upload will be image instances, not SR. When no SR content is found:
1. Build text from metadata: "{Modality} of {BodyPartExamined}, {Laterality}, Study Date: {StudyDate}. {StudyDescription}"
2. Pass this text to Clinical Context Agent as report_type="imaging"
3. The agent will not extract diagnoses from metadata alone (correct behavior -- a knee MRI file does not contain findings, the radiology report does)
8. Frontend Changes¶
File Input¶
ConversationApp.tsx -- update the accept attribute:
Current: accept=".pdf,.jpg,.jpeg,.png,.doc,.docx"
New: accept=".pdf,.jpg,.jpeg,.png,.doc,.docx,.dcm,.dicom"
Frontend File Validation¶
Update the client-side extension check to include .dcm and .dicom. DICOM files can be large (50-200MB for multi-frame), but Phase 1 only supports single-frame files. Keep the 20MB limit for now; revisit in Phase 2 for series support.
File Attachment Card¶
When file_type === "dicom", show:
- A medical imaging icon (Lucide ScanLine or FileImage) instead of the generic file icon
- Badge: "DICOM" in teal
- After processing: body part + modality label (e.g., "MR - Left Knee")
EHR Drawer / Documents Tab¶
When a document has extracted_data.thumbnail_key:
- Fetch thumbnail via presigned URL
- Display as a small preview (128x128) in the document row
- Click to expand to 256x256 in a lightbox
- Overlay: modality + body part + study date
When no thumbnail (e.g., DICOM-SR without pixel data): - Show generic imaging icon - Display metadata summary: modality, body part, study date
9. Testing Plan¶
Unit Tests (tests/test_dicom_parser.py)¶
| Test | Description | Assertion |
|---|---|---|
test_parse_valid_dicom |
Parse a synthetic DICOM file | Returns metadata with body_part, modality, study_date |
test_parse_dicom_missing_tags |
DICOM with minimal tags | Graceful None values, no crash |
test_parse_non_dicom_file |
Pass a PDF to the parser | Raises DicomParseError, does not crash |
test_deidentify_strips_patient_name |
Check PatientName tag | Replaced with "DEIDENTIFIED" |
test_deidentify_strips_patient_id |
Check PatientID tag | Replaced with "DEIDENTIFIED" |
test_deidentify_preserves_modality |
Check clinical tags | Modality, BodyPartExamined unchanged |
test_deidentify_preserves_study_date |
Check StudyDate | Preserved |
test_deidentify_replaces_uids |
Check instance UIDs | New UIDs generated, old ones gone |
test_deidentify_nested_sr |
SR with PersonName in content | PersonName in ContentSequence stripped |
test_sr_extraction_basic_text |
Basic Text SR | Extracts TEXT findings with concept names |
test_sr_extraction_measurements |
SR with NUM values | Extracts value + unit |
test_sr_extraction_coded_diagnosis |
SR with CODE values | Extracts SNOMED/ICD codes |
test_sr_no_content_sequence |
Image DICOM, no SR | Returns empty findings, no crash |
test_thumbnail_generation |
DICOM with pixel data | Returns PNG bytes, dimensions <= 256x256 |
test_thumbnail_no_pixel_data |
DICOM-SR or metadata-only | Returns None |
test_build_text_representation |
Metadata + findings | Produces coherent text string for the Clinical Context Agent |
test_body_part_modality_matching |
KNEE + MR vs TKR requirements | Returns match with 0.95+ confidence |
Integration Tests (tests/test_dicom_pipeline.py)¶
| Test | Description |
|---|---|
test_dicom_upload_confirm_flow |
Upload .dcm via presign, confirm, verify extracted_data populated |
test_dicom_skips_ocr |
Upload .dcm, verify OCR not attempted, ocr_method="dicom_metadata" |
test_dicom_deidentified_in_r2 |
Upload .dcm, download from R2, verify PatientName absent |
test_dicom_thumbnail_stored |
Upload .dcm with pixel data, verify _thumb.png key exists in R2 |
test_dicom_sr_feeds_clinical_context |
Upload DICOM-SR, verify Clinical Context Agent produces FHIR resources |
test_dicom_requirement_matching |
Upload knee MRI DICOM for TKR case, verify requirement auto-matched |
test_dicom_feature_flag_off |
Disable dicom_support_enabled, upload .dcm, verify rejection |
Test Fixtures¶
Create synthetic DICOM files using pydicom (no real patient data):
# tests/conftest.py
import pydicom
from pydicom.dataset import Dataset, FileDataset
from pydicom.uid import generate_uid
@pytest.fixture
def sample_dicom_bytes():
ds = Dataset()
ds.PatientName = "TEST^PATIENT"
ds.PatientID = "TEST123"
ds.Modality = "MR"
ds.BodyPartExamined = "KNEE"
ds.Laterality = "L"
ds.StudyDate = "20260315"
ds.StudyDescription = "MRI LEFT KNEE W/O CONTRAST"
ds.BitsAllocated = 16
ds.Rows = 64
ds.Columns = 64
ds.PixelData = b'\x00' * (64 * 64 * 2)
# ... set required DICOM file meta
# Return bytes via BytesIO
10. Risks and Mitigations¶
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| Large DICOM files (>20MB) | Medium | High | Phase 1 keeps the 20MB limit. Multi-frame/series support in Phase 2 with chunked upload. Warn user at frontend if file too large. |
| Compressed pixel data | Medium | Medium | DICOM supports JPEG2000, RLE, JPEG-LS compression. pydicom can decompress most with pydicom[PIL] extras. If decompression fails, skip thumbnail but still extract metadata. |
| DICOM files without standard tags | Low | Medium | Graceful degradation: missing tags return None, parser still succeeds. Minimum viable: if file parses as DICOM at all, accept it. |
| PII leak via pixel data | High | Low | Burned-in patient annotations on images (name/MRN in pixel data). Phase 1 does NOT scrub pixel data (that requires OCR on the image itself). Risk accepted for MVP; flagged for Phase 2. |
| Malicious DICOM files | High | Low | pydicom is a parser, not an executor. Limit pixel data memory to 512MB via pydicom.config.MAXIMUM_TAG_SIZE_IN_MEMORY. Reject files that fail to parse. |
| DICOM-SR without standard templates | Low | Medium | Non-standard SR trees are walked generically. Unknown containers logged but not classified. TEXT values always extracted regardless of container name. |
| Railway memory limits | Medium | Low | pydicom loads entire file into memory. 20MB DICOM is manageable. Phase 2 multi-frame series (500MB+) would need streaming -- out of scope. |
| R2 overwrite race condition | Low | Low | De-identified file overwrites original atomically via R2 PUT. If the write fails, original stays (acceptable -- retry will deidentify again). |
11. Dependencies¶
Python Packages¶
| Package | Version | Purpose | Size |
|---|---|---|---|
pydicom |
>=2.4.0,<3.0 |
DICOM file parsing, tag access, SR traversal, UID generation | ~8MB |
Pillow |
>=10.0.0,<11.0 |
Pixel data to PNG thumbnail conversion | ~3MB (likely already installed) |
Optional (Not Required for Phase 1)¶
| Package | Purpose | When |
|---|---|---|
pylibjpeg + pylibjpeg-libjpeg |
JPEG compressed transfer syntax decompression | If users upload JPEG-compressed DICOM |
pylibjpeg-openjpeg |
JPEG2000 compressed transfer syntax | If users upload J2K-compressed DICOM |
highdicom |
Higher-level SR parsing, Measurement Report templates | Phase 2 if SR parsing needs more sophistication |
gdcm |
Handles exotic transfer syntaxes | Post-MVP if decompression failures are common |
No New Infrastructure¶
- No new database tables (uses existing
document_references.extracted_dataJSONB) - No new R2 buckets (thumbnails stored alongside documents)
- No new services or containers
- No new external API calls
Appendix A: DICOM Modality Codes to Procedure Requirement Mapping¶
| Modality Code | Modality Name | Typical Procedure Match |
|---|---|---|
| MR | Magnetic Resonance | Knee MRI, Brain MRI, Spine MRI |
| CT | Computed Tomography | CT Angiogram, CT Scan |
| CR / DX | Computed/Digital Radiography | X-Ray, Chest X-Ray |
| US | Ultrasound | Cardiac Echo, Abdominal Ultrasound |
| NM | Nuclear Medicine | Bone Scan, PET Scan |
| PT | PET | PET Scan |
| XA | X-Ray Angiography | Cardiac Catheterization |
| MG | Mammography | Mammogram |
| ECG | Electrocardiography | ECG/EKG |
| SR | Structured Report | Radiology Report (text, not imaging) |
Appendix B: Body Part to Procedure Code Mapping¶
| BodyPartExamined | Procedure Codes |
|---|---|
| KNEE | knee_replacement (27447), acl_reconstruction |
| HIP | hip_replacement (27130) |
| SPINE / LSPINE / CSPINE | spinal_fusion, laminectomy |
| HEART / CHEST | cabg (33533), valve_replacement |
| SHOULDER | shoulder_replacement, rotator_cuff |
| ABDOMEN | bariatric (43775), hernia_repair |
| BRAIN / HEAD | craniotomy |
| BREAST | mastectomy |
These mappings are used by the deterministic matcher in match_dicom_metadata_to_requirements() and will be stored in config/dicom_mappings.yaml for configurability.