Advancing Radiology with Multimodal Deep Learning: From Automated Reporting to Comprehensive Diagnostic Insights

Santomauro, Andrea

This doctoral thesis investigates the development of deep learning methods for automated radiology report generation from medical images, with a particular focus on transformer-based architectures and their adaptation to clinically relevant tasks. The aim of the research is to design, implement, and evaluate novel mechanisms that (1) integrate image analysis and natural language processing for free-text report generation, (2) ensure clinical accuracy and completeness, (3) handle multi-label classification in complex imaging scenarios, and (4) provide lightweight, modular solutions suitable for low-resource settings. To address these objectives, four individual studies were conducted. Study 1 examined the limitations of standard positional encoding in Vision Transformers when applied to medical imaging and proposed a similarity-based positional encoding scheme. By computing convolutional feature–driven cosine similarities among image patches and projecting them into the transformer’s embedding space, this method consistently improved classification accuracy, precision, recall, and F1 scores across six diverse MedMNIST datasets under both training-from-scratch and transfer-learning regimes. Study 2 focused on multi-label classification in chest radiographs by introducing dedicated class registers within a Vision Transformer architecture. Each register acted as a class-specific query, allowing the model to learn disentangled representations for up to 14 thoracic diseases on the CheXpert dataset. This register-based design not only achieved a mean ROC AUC of 0.81 across all labels but also yielded attention maps that aligned with clinically relevant regions, enhancing model interpretability. Study 3 explored contrastive learning as an alternative paradigm for automatic report generation. A multimodal Siamese network was trained to align chest X-ray embeddings (from a fine-tuned CheXNet) with textual embeddings using a contrastive loss. The learned joint embedding space enabled downstream free-text report generation via the BERT2BERT decoder, producing embeddings with an average L2 distance of 0.21 from reference reports, and demonstrated robustness to sparse or noisy supervision. Study 4 evaluated a streamlined, decoder-only approach by combining a visual encoder (CheXNet or ViT) with GPT-2 for report generation on MIMIC-CXR. Different positional encoding strategies were compared—absolute, sinusoidal, and uniform token assignment—and the best configuration (ViT + custom encoding + beam search) achieved a BERTScore F1 of 0.88. A human expert review by two board-certified radiologists further confirmed high clinical quality, with no errors in over 62\% of reports and at most one error in over 84\%. Collectively, these studies provide a broad understanding of how transformer-based models can be adapted and evaluated for medical image classification and report generation. The findings have implications for embedding structural relationships in spatial encodings, disentangling multi-label attention flows, harnessing contrastive objectives for cross-modal alignment, and designing resource-efficient architectures—offering new insights into the integration of AI in radiological practice.

Advancing Radiology with Multimodal Deep Learning: From Automated Reporting to Comprehensive Diagnostic Insights / Andrea Santomauro , 2025 Nov 21. 37. ciclo