Consistent Direct Time-of-Flight Video Depth Super-Resolution

CVPR 2023

1Stanford University, 2Meta Reality Labs

Update 04/30/2023: We open-sourced our code and dataset!



Video comparisons between SOTA per-frame processing algorithm and the proposed depth video super-resolution (DVSR) & histogram video super-resolution (HVSR) solutions. DVSR improves depth prediction accuracies and temporal stabilities over baseline, while HVSR further improves fine details.




Direct time-of-flight (dToF) sensors are promising for next-generation on-device 3D sensing. However, to achieve the sufficient signal-to-noise-ratio (SNR) in a compact module, the dToF data has limited spatial resolution (e.g., ~20x30 for iPhone dToF), and it requires a super-resolution step before being passed to downstream tasks. In this paper, we solve this super-resolution problem by fusing the low-resolution dToF data with the corresponding high-resolution RGB guidance. Unlike the conventional RGB-guided depth enhancement approaches which perform the fusion in a per-frame manner, we propose the first multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the low-resolution dToF imaging. In addition, dToF sensors provide unique depth histogram information for each local patch, and we incorporate this dToF-specific feature in our network design to further alleviate spatial ambiguity. To evaluate our models on complex dynamic indoor environments and to provide a large-scale dToF sensor dataset, we introduce DyDToF, the first synthetic RGB-dToF video dataset that features dynamic objects and a realistic dToF simulator following the physical imaging process. We believe the methods and dataset are beneficial to a broad community as dToF depth sensing is becoming mainstream on mobile devices.



DyDToF Dataset


We introduce DyDToF dataset, the first synthetic RGB-dToF dataset that features dynamic objects and realistic dToF simulator.



Realworld Comparisons with Apple ARKit


Since Apple ARKit does not provide raw dToF data, we naively downsample the pre-processed dToF data as input to our DVSR network. Our DVSR network generalizes well to this task even without finetuning. Compared to the high-resolution processing results provided by ARKit, DVSR achieves better temporal consistency and corrects errors in the ARKit preprocessed data. This further demonstrate the effectiveness of the multi-frame fusion approach.





  title={Consistent Direct Time-of-Flight Video Depth Super-Resolution},
  author={Sun, Zhanghao and Ye, Wei and Xiong, Jinhui and Choe, Gyeongmin and Wang, Jialiang and Su, Shuochen and Ranjan, Rakesh},
  journal={arXiv preprint arXiv:2211.08658},