MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

1Tsinghua University, 2The Hong Kong University of Science and Technology

We present MVInverse, a multi-view feed-forward inverse rendering method which recovers consistent intrinsic images in seconds.

Abstract

Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.

Framework

Framework diagram

Given an input image sequence, our framework first encodes each frame into patch tokens via DINOv2, where alternating global–frame attention enables cross-view feature aggregation. Meanwhile, a frame-wise ResNeXt encoder provides multi-resolution convolutional features to preserve fine spatial details. The two feature streams are fused in a DPT-style prediction head to produce pixel-aligned intrinsic maps, including albedo, metallic, roughness, normal, and shading. The diffuse image is obtained as the product of albedo and diffuse shading.

Motivation

Motivation diagram

Given the red and blue query patches in the first view, we visualize the corresponding attention heatmaps in the second view (bottom row) to illustrate the effectiveness of our model design. Red: the model captures long-range lighting interactions across spatially distant regions. Blue: the model maintains cross-view consistency by correctly associating corresponding surface regions under viewpoint changes.

Single-view Results

Singleview-compare

Multi-view Results

Results on Mipnerf-360 Dataset

mipnerf1
mipnerf2

Results on Tanks and Temples Dataset

tanks2
tanks1

Results on DL3DV Dataset

multiview1
multiview2

Ablation

ablation2
ablation2