CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

Du, Yuexi; Wang, Jinglu; Liu, Shujie; Dvornek, Nicha C.; Lu, Yan

CARE: Towards Clinical Accountability
in Multi-Modal Medical Reasoning
with an Evidence-Grounded Agentic Framework

Yuexi Du^1,2, Jinglu Wang^1*, Shujie Liu¹, Nicha C. Dvornek^2,3, Yan Lu¹

¹Microsoft Research Asia
²Department of Biomedical Engineering, Yale University
³Department of Radiology & Biomedical Imaging, Yale University
^*Corresponding Author
ICLR 2026

Paper Code (Coming Soon) arXiv

CARE teaser figure comparing different VLM approaches for medical reasoning: (a) single-shot VLMs, (b) grounding VLMs, (c) generalist visual reasoning VLMs, and (d) our agentic CARE-Coord framework, alongside (e) accuracy vs. model size comparison.

CARE decomposes medical reasoning into coordinated expert submodules—entity proposal, referring segmentation, and evidence-grounded VQA—aligned with the clinical diagnostic workflow. CARE-Coord outperforms the heavily trained state-of-the-art (Lingshu-32B) by 5.2% on average across four medical VQA benchmarks.

Abstract

Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementary to this, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust.

In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated submodules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification.

Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and reviewing, our CARE-Coord yields a further gain, outperforming the heavily trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialists and explicit evidence, yields more accurate and accountable medical AI.

Method

CARE decomposes multi-modal medical reasoning into specialized sub-tasks and integrates expert visual tools with agentic coordination, aligning the pipeline with clinical practice. Given a user question and a medical image, CARE executes three stages:

Stage 1: Medical Entity Proposal

A question-conditioned, compact VLM proposes relevant anatomical structures or findings (e.g., organs, lesions, devices). The VLM is fine-tuned with reinforcement learning using a verifiable, embedding-similarity reward for evidence-consistent proposals.

Stage 2: Entity Referring Segmentation

A tailored referring-segmentation model, built on SA-Med-2D with a biomedical text encoder, localizes the proposed entities and produces pixel-level ROI evidence. A confidence score filters out low-quality segmentations to prevent noisy evidence from harming downstream reasoning.

Stage 3: Evidence-Grounded VQA

A fine-tuned VQA model reasons over the full image augmented by one of three evidence views reflecting clinical practice: (i) a zoom-in crop for local detail, (ii) a binary mask for positional/spatial priors, or (iii) a global indicator when local evidence is unnecessary.

CARE-Flow executes a static pipeline through all three stages with majority voting. CARE-Coord adds a dynamic VLM coordinator that plans tool invocations, selects the most informative evidence view, and performs iterative chain-of-thought review to verify reasoning quality and mitigate hallucinations.

Results

Overall Performance Comparison

We evaluate CARE on four standard medical VQA benchmarks: OmniMedVQA, VQA-RAD, SLAKE (in-domain), and VQA-Med-2019 (out-of-domain), spanning over ten image modalities and multiple organs. Open-ended questions are scored by GPT-4o against ground-truth answers.

Quantitative Results on Medical VQA Benchmarks

Model	OMVQA-3k	VQA-RAD	SLAKE	VQA-Med-2019	Overall
Proprietary
GPT-4o	64.07	58.54	63.55	59.60	61.44
GPT-5	74.73	63.19	67.75	62.20	66.97
Open-Source General VLMs
Llama-3.2-11B-Vision	43.10	53.22	63.17	57.40	54.22
Qwen2.5-VL-7B	61.40	54.10	59.73	50.60	56.46
Qwen2.5-VL-32B	65.10	61.20	65.46	51.60	60.84
InternVL3-8B	75.97	61.86	66.13	57.40	65.34
InternVL3-38B	78.57	62.97	68.70	58.80	67.26
DeepEyes-7B	57.40	56.10	61.16	52.20	56.72
Medical Expert VLMs
LLaVA-Med-7B	45.30	41.91	50.86	37.00	43.77
MedVLM-R1-2B	72.07	41.46	46.47	45.40	51.35
medgemma-4b	61.50	58.09	69.66	47.40	59.16
medgemma-27b	64.23	62.75	70.52	48.40	61.47
HuatuoGPT-Vision-7B	70.70	59.87	60.50	57.20	62.07
HuatuoGPT-Vision-34B	76.80	60.75	64.12	60.60	65.57
Lingshu-7B	73.17	58.54	76.15	58.80	66.66
Lingshu-32B	83.97	64.75	82.25	58.20	72.29
Ours
CARE-Flow-S (4B)	94.53	56.32	78.44	53.60	70.72
CARE-Coord-S	97.70	62.75	77.19	60.60	74.56
CARE-Flow-B (10B)	96.17	63.64	83.21	56.60	74.91
CARE-Coord-B	97.97	68.29	83.11	60.80	77.54

Bold = best, underline = second best. Medical expert VLMs are shown with gray background; our methods are in green.

See more ablation experiments and analysis in our paper.

Qualitative Examples

We present case studies showing CARE-Coord's complete reasoning trace. The coordinator plans tool invocations, selects the optimal evidence view, and reviews the chain-of-thought for consistency. Key coordinator outputs are highlighted in blue, model reasoning in green, and final answers in yellow.

Case study: complete reasoning trace for a CT disease identification question.

CT Disease Identification. The coordinator chose the correct visual clue type and corrected an impractical entity proposal, helping the EG-VQA model focus on the relevant local detail.

Case study: full reasoning process for a brain MRI question.

Brain MRI Analysis. Full reasoning process for a user question about a brain MRI image, demonstrating entity proposal, segmentation, and evidence-grounded reasoning.

Case study: full reasoning process for a chest X-ray question.

Chest X-Ray Interpretation. The framework accurately localizes the region of interest on a chest X-ray and provides grounded reasoning for the diagnosis.

Case study: full reasoning process for a body CT question that skips entity proposal.

Body CT — Global Question. This example skipped entity proposal and segmentation, as the question concerned global information (e.g., imaging modality), showcasing the coordinator's dynamic planning.

Case study: full reasoning process for a dermatology image question.

Dermatology Image. The coordinator plans the appropriate tool sequence and evidence view for a question about a dermatology image.

Conclusion

We propose CARE, a medical vision reasoning agent that follows a real-world visual-guided clinical decision-making process. Rather than producing a single-shot, black-box output, CARE divides medical decision-making into three interpretable stages with dedicated expert models: identify the entity of interest, localize the ROI on the image, and reason using local visual evidence.

Compared to existing methods, CARE not only achieves stronger performance on open benchmarks, but also demonstrates greater accountability and reliability. With a robust coordinator (CARE-Coord), dynamic planning and iterative chain-of-thought review further expand its capability, yielding competitive accuracy in both in-domain and out-of-domain settings.

Ethics Statement

This work uses only publicly available medical VQA benchmarks (OmniMedVQA, VQA-RAD, SLAKE, VQA-Med-2019) and segmentation data (SA-Med-20M); no new data were collected and no patient interaction occurred. All datasets are used under their respective licenses, with no attempt to re-identify individuals. This system is intended for research use only and must not be used for clinical diagnosis or treatment.

BibTeX

@misc{du2026care,
      title={CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework}, 
      author={Yuexi Du and Jinglu Wang and Shujie Liu and Nicha C. Dvornek and Yan Lu},
      year={2026},
      eprint={2603.01607},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.01607}, 
}

CARE: Towards Clinical Accountabilityin Multi-Modal Medical Reasoningwith an Evidence-Grounded Agentic Framework