CLiFT icon CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering



Zhengqing Wang1     Yuefan Wu1     Jiacheng Chen1      Fuyang Zhang1      Yasutaka Furukawa1,2
1Simon Fraser University     2Wayve    
We represent a 3D scene as compressed light-field tokens (CLiFTs) that enable adaptive neural rendering with configurable compute budgets, providing flexible trade-offs between rendering speed, representation storage, and visual quality.

Abstract

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view "condenser" compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

Method Overview

We present a step-by-step visualization of CLiFT's test-time process, demonstrating how multi-view images are tokenized, clustered, and compressed into CLiFTs, followed by the adaptive rendering process that selects relevant CLiFTs based on the target viewpoint and compute budget.
We then illustrate the detailed method pipeline. Training (top): Multi-view images are processed through three stages: (1) Multi-view encoder tokenizes input images with camera poses, (2) Latent K-means clusters tokens to select representative centroids, and (3) Neural condensation compresses all token information into the selected centroids to create CLiFTs. Inference (bottom): The multi-view input images are first encoded into CLiFTs by the same process as in training. Then, relevant CLiFTs are collected based on the target viewpoint and compute budget, which are passed to the adaptive renderer to produce the target view.
Method Overview

Results

Comparison wtih LVSM on RealEstate10K

Comparison wtih DepthSplat on DL3DV

Detailed Comparison

Some detailed frame-level comparisons between DepthSplat and CLiFT with different compression ratios.

More Qualitative Results on RealEstate10K

More Qualitative Results on DL3DV

Visualization of Latent K-means Clustering

Finally, we visualize the latent K-means clustering results with K=128 for better visualization. Each color represents a cluster, and the yellow ring indicates the centroid token. Note that clustering is performed across multiple views, so a single cluster can span multiple images. As a result, some clusters may not have a visible centroid in a given image.
K-means clustering visualization

BibTeX

@article{Wang2025CLiFT,
  author    = {Wang, Zhengqing and Wu, Yuefan and Chen, Jiacheng and Zhang, Fuyang and Furukawa, Yasutaka},
  title     = {CLiFT: Compressive Light-Field Tokens for Compute Efficient and Adaptive Neural Rendering},
  journal   = {arXiv preprint arXiv:2507.08776},
  year      = {2025},
}

Acknowledgements

We thank Haian Jin for helpful discussions on reproducing LVSM and training on the DL3DV dataset. This research is partially supported by NSERC Discovery Grants, NSERC Alliance Grants, and John R. Evans Leaders Fund (JELF). We thank the Digital Research Alliance of Canada and BC DRI Group for providing computational resources.

References