Paper-grounded figure-to-video generation

Helping Figures Tell their Story!
Paper-Grounded Video Generation Explaining Complex Scientific Figures

University of Maryland
Anonymous ARR submission 2026

*Equal contribution
Narration stage
01Narration
Perception stage
02Perception
Grounding stage
03Grounding

Abstract

Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights, a capability missing from current video generation systems and benchmarks.

To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived.

On FigTalk, MINARD generates human-like, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation.

Example Job Comparison

Lecture exports rendered with grounding configurations, generated video, and ground truth highlights.

Results

Grounding Evaluation on FigTalk-Extended

Each cell reports CF, CC, and EH.

Grounding Evaluation on the FigTalk-Extended. Darker shades indicate stronger grounding quality.
MINARDCF.84CC.79EH.09CF.79CC.74EH.12CF.81CC.76EH.10CF.77CC.72EH.14
SAM+BoxCF.71CC.74EH.28CF.59CC.61EH.34CF.69CC.71EH.29CF.55CC.57EH.36
VLM-Grnd.CF.58CC.62EH.34CF.46CC.51EH.39CF.55CC.59EH.35CF.42CC.46EH.41
SlideTalkerCF.42CC.45EH.48CF.33CC.36EH.53CF.40CC.43EH.50CF.30CC.33EH.55
TEACF.24CC.27EH.58CF.17CC.20EH.66CF.22CC.25EH.60CF.14CC.17EH.69
C2V-ImageCF.21CC.24EH.61CF.15CC.18EH.69CF.19CC.22EH.63CF.12CC.15EH.72
Veo-3.1CF.18CC.21EH.62CF.11CC.14EH.71CF.16CC.19EH.65CF.09CC.12EH.74
CogVideoXCF.13CC.16EH.68CF.08CC.11EH.76CF.12CC.15EH.70CF.06CC.09EH.79

D1 Narration Quality

FigTalk-Gold narration quality by input regime and backbone.

Color encodes within-column rank: best, strong, mid, weak, worst. MINARD wins every column.
Figure-onlyF onlyGemini0.710.780.560.390.24
Figure-onlyF onlyClaude0.720.790.580.410.26
Figure-onlyF onlyGPT-50.730.800.600.430.28
Paper2VideoF + DGemini0.660.770.540.310.39
Paper2VideoF + DClaude0.680.790.570.330.42
Paper2VideoF + DGPT-50.690.810.590.360.45
MINARDF + DGemini0.740.800.750.700.74
MINARDF + DClaude0.750.820.790.760.78
MINARDF + DGPT-50.760.830.840.810.82

Grounding F1 by Difficulty

F1 by difficulty: Ishani narration
F1 by difficulty: step-by-step narration
F1 by difficulty: transcript narration