| dc.description.abstract |
Automated medical report generation (MRG) has gained significant research value for its potential to reduce workload and prevent diagnostic errors. Despite recent advances, generating accurate radiology reports remains challenging, as existing models often struggle to visually ground on the clinically important region, which is critical for practical application. We identify three key factors that make visual grounding particularly difficult in medical imaging: deficiency of visual cues in medical images, im- balance of disease distribution, and the inherent frequent bias of the decoder, which tends to prioritize common findings over clinically important findings. In this work, we propose RDP-MRG, a medical report generation framework that mimics the radiologist diagnosis process. Our approach follows a coarse-to-fine diagnostic process composed of three integrated stages. First, the model localizes suspicious regions at the macro-level diagnosis stage by amplifying subtle visual cues using anatomical and clinical knowledge (Visual Cue Amplification, VCA). Second, it identifies the corresponding organ and infers associated diseases for each localized region at the micro-level diagnosis stage (Visual Cue Embodiment, VCE). Finally, the model explicitly leverages the localized and inferred diagnostic information—lesions, organs, and diseases—as guidance to generate visually grounded reports (Visually Grounded Generation, VGG). We evaluate RDP-MRG on two benchmark datasets, MIMIC-CXR and IU-Xray. On MIMIC-CXR, our method achieves superior clinical accuracy among single-stage MRG models and attains performance that is comparable to or even exceeds that of two-stage MRG approaches. Furthermore, RDP-MRG establishes state-of-the-art zero-shot performance on IU-Xray, demonstrating strong cross-dataset generalizability. Extensive experimental results further confirm that our coarse-to-fine diagnostic framework effectively addresses the key challenges in medical report gen- eration, resulting in improved visual grounding and clinical efficacy. |
- |