Given an input image sequence, our framework first encodes each frame into patch tokens via DINOv2, where alternating global–frame attention enables cross-view feature aggregation. Meanwhile, a frame-wise ResNeXt encoder provides multi-resolution convolutional features to preserve fine spatial details. The two feature streams are fused in a DPT-style prediction head to produce pixel-aligned intrinsic maps, including albedo, metallic, roughness, normal, and shading. The diffuse image is obtained as the product of albedo and diffuse shading.
Given the red and blue query patches in the first view, we visualize the corresponding attention heatmaps in the second view (bottom row) to illustrate the effectiveness of our model design. Red: the model captures long-range lighting interactions across spatially distant regions. Blue: the model maintains cross-view consistency by correctly associating corresponding surface regions under viewpoint changes.